我想找到一种使用数据帧在PySpark中创建备用向量的有效方法.
让我们说给出交易输入:
df = spark.createDataFrame([
(0, "a"),
(1, "a"),
(1, "b"),
(1, "c"),
(2, "a"),
(2, "b"),
(2, "b"),
(2, "b"),
(2, "c"),
(0, "a"),
(1, "b"),
(1, "b"),
(2, "cc"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
Run Code Online (Sandbox Code Playgroud)
+---+--------+
| id|category|
+---+--------+
| 0| a|
| 1| a|
| 1| b|
| 1| c|
| 2| a|
| 2| b|
| 2| b|
| 2| b|
| 2| c|
| 0| a|
| 1| b|
| 1| b| …Run Code Online (Sandbox Code Playgroud)