上下文:我有DataFrame2列:单词和向量.其中"vector"的列类型是VectorUDT.
一个例子:
word | vector
assert | [435,323,324,212...]
Run Code Online (Sandbox Code Playgroud)
我希望得到这个:
word | v1 | v2 | v3 | v4 | v5 | v6 ......
assert | 435 | 5435| 698| 356|....
Run Code Online (Sandbox Code Playgroud)
题:
如何使用PySpark为每个维度拆分包含多列向量的列?
提前致谢
python apache-spark apache-spark-sql pyspark apache-spark-ml
使用SparkML预测标签时,结果Dataframe是:
scala> result.show
+-----------+--------------+
|probability|predictedLabel|
+-----------+--------------+
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.1,0.9]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.0,1.0]| 0.0|
| [0.1,0.9]| 0.0|
| [0.6,0.4]| 1.0|
| [0.6,0.4]| 1.0|
| [1.0,0.0]| 1.0|
| [0.9,0.1]| 1.0|
| [0.9,0.1]| 1.0|
| [1.0,0.0]| 1.0|
| [1.0,0.0]| 1.0|
+-----------+--------------+
only showing top 20 rows
Run Code Online (Sandbox Code Playgroud)
我想用一个名为prob的新列创建一个新的Dataframe,它是原始Dataframe的Vector in probability列中的第一个值,例如:
+-----------+--------------+----------+
|probability|predictedLabel| prob |
+-----------+--------------+----------+ …Run Code Online (Sandbox Code Playgroud) scala dataframe apache-spark apache-spark-sql apache-spark-mllib
我正在尝试执行以下操作:
+-----+-------------------------+----------+-------------------------------------------+
|label|features |prediction|probability |
+-----+-------------------------+----------+-------------------------------------------+
|0.0 |(3,[],[]) |0 |[0.9999999999999979,2.093996169658831E-15] |
|1.0 |(3,[0,1,2],[0.1,0.1,0.1])|0 |[0.999999999999999,9.891337521299582E-16] |
|2.0 |(3,[0,1,2],[0.2,0.2,0.2])|0 |[0.9999999999999979,2.0939961696578572E-15]|
|3.0 |(3,[0,1,2],[9.0,9.0,9.0])|1 |[2.093996169659668E-15,0.9999999999999979] |
|4.0 |(3,[0,1,2],[9.1,9.1,9.1])|1 |[9.89133752128275E-16,0.999999999999999] |
|5.0 |(3,[0,1,2],[9.2,9.2,9.2])|1 |[2.0939961696605603E-15,0.9999999999999979]|
+-----+-------------------------+----------+-------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
将上面的数据框转换为另外两列:prob1&prob2
每列具有列中存在的相应值probability.
我发现了类似的问题 - 一个在PySpark,另一个在Scala.我不知道如何翻译PySpark代码,我收到了Scala代码的错误.
PySpark代码:
split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())
output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))
Run Code Online (Sandbox Code Playgroud)
或者将这些列附加到原始数据框:
randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))
Run Code Online (Sandbox Code Playgroud)
Scala代码:
import org.apache.spark.sql.functions.udf
val getPOne = udf((v: org.apache.spark.mllib.linalg.Vector) => v(1))
model.transform(testDf).select(getPOne($"probability"))
Run Code Online (Sandbox Code Playgroud)
运行Scala代码时出现以下错误:
scala> predictions.select(getPOne(col("probability"))).show(false)
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(probability)' due …Run Code Online (Sandbox Code Playgroud) 我有两列:一列为 Integer 类型,一列为 linalg.Vector 类型。我可以将 linalg.Vector 转换为数组。每个数组有 32 个元素。我想将数组中的每个元素转换为一列。所以输入就像:
column1 column2
(3, 5, 25, ...., 12) 3
(2, 7, 15, ...., 10) 4
(1, 10, 12, ..., 35) 2
Run Code Online (Sandbox Code Playgroud)
输出应该是:
column1_1 column1_2 column1_3 ......... column1_32 column 2
3 5 25 ......... 12 3
2 7 15 ......... 10 4
1 1 0 12 ......... 12 2
Run Code Online (Sandbox Code Playgroud)
但在我的例子中,数组中有 32 个元素。在spark scala中使用有问题的Convert Array of String column to multiple columns的方法太多了
我尝试了几种方法,但都不起作用。这样做的正确方法是什么?
多谢。