相关疑难解决方法(0)

为什么Spark的OneHotEncoder默认删除最后一个类别？

我想了解Spark的OneHotEncoder默认情况下放弃最后一个类别的理性.

例如:

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
|   x|  c|c_idx|
+----+---+-----+
| 1.0|  a|  0.0|
| 1.5|  a|  0.0|
|10.0|  b|  1.0|
| 3.2|  c|  2.0|
+----+---+-----+

Run Code Online (Sandbox Code Playgroud)

默认情况下,OneHotEncoder将删除最后一个类别:

>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(2,[0],[1.0])|
| 1.5|  a|  0.0|(2,[0],[1.0])|
|10.0|  b|  1.0|(2,[1],[1.0])|
| 3.2|  c|  2.0|    (2,[],[])|
+----+---+-----+-------------+ …

Run Code Online (Sandbox Code Playgroud)

machine-learning bigdata apache-spark pyspark one-hot-encoding

Cor*_*rey

2017 09-23

11
推荐指数

1
解决办法

2673
查看次数