PySpark：OneHotEncoder 的输出看起来很奇怪

Question

PySpark：OneHotEncoder 的输出看起来很奇怪

Eli*_*hle 0 apache-spark pyspark apache-spark-mllib one-hot-encoding

星火文档包含一个PySpark例如其OneHotEncoder：

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = spark.createDataFrame([
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, "a"),
    (4, "a"),
    (5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

Run Code Online (Sandbox Code Playgroud)

我希望该列categoryVec看起来像这样：

[0.0, 0.0]
[1.0, 0.0]
[0.0, 1.0]
[0.0, 0.0]
[0.0, 0.0]
[0.0, 1.0]

Run Code Online (Sandbox Code Playgroud)

但categoryVec实际上看起来是这样的：

(2, [0], [1.0])
    (2, [], [])
(2, [1], [1.0])
(2, [0], [1.0])
(2, [0], [1.0])
(2, [1], [1.0])

Run Code Online (Sandbox Code Playgroud)

这是什么意思？我应该如何阅读这个输出，这种有点奇怪的格式背后的原因是什么？

Answer 1

小智 5

这里没什么奇怪的。这些就是SparseVectors：

第一个元素是向量的大小
第一个数组[...]是索引列表。
第二个数组是值列表。

未明确列出的指数为 0.0。

归档时间：	7 年，10 月前
查看次数：	491 次
最近记录：	7 年，10 月前