相关疑难解决方法(0)

将PySpark DataFrame ArrayType字段组合到单个ArrayType字段中

我有一个包含2个ArrayType字段的PySpark DataFrame:

>>>df
DataFrame[id: string, tokens: array<string>, bigrams: array<string>]
>>>df.take(1)
[Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])]

Run Code Online (Sandbox Code Playgroud)

我想将它们组合成一个ArrayType字段:

>>>df2
DataFrame[id: string, tokens_bigrams: array<string>]
>>>df2.take(1)
[Row(id='ID1', tokens_bigrams=['one', 'two', 'two', 'one two', 'two two'])]

Run Code Online (Sandbox Code Playgroud)

使用字符串的语法似乎不起作用:

df2 = df.withColumn('tokens_bigrams', df.tokens + df.bigrams)

Run Code Online (Sandbox Code Playgroud)

谢谢!

python dataframe apache-spark apache-spark-sql pyspark

zem*_*eng

2019 01-07

12
推荐指数

2
解决办法

2万
查看次数

ArrayColumn Pyspark 上的计数器函数

从这个数据框

+-----+-----------------+
|store|     values      |
+-----+-----------------+
|    1|[1, 2, 3,4, 5, 6]|
|    2|            [2,3]|
+-----+-----------------+

Run Code Online (Sandbox Code Playgroud)

我想应用这个Counter函数来得到这个：

+-----+------------------------------+
|store|     values                   |
+-----+------------------------------+
|    1|{1:1, 2:1, 3:1, 4:1, 5:1, 6:1}|
|    2|{2:1, 3:1}                    |
+-----+------------------------------+

Run Code Online (Sandbox Code Playgroud)

我使用另一个问题的答案得到了这个数据框：

GroupBy 和 concat 数组列 pyspark

所以我尝试修改答案中的代码，如下所示：

选项1：

+-----+-----------------+
|store|     values      |
+-----+-----------------+
|    1|[1, 2, 3,4, 5, 6]|
|    2|            [2,3]|
+-----+-----------------+

Run Code Online (Sandbox Code Playgroud)

选项 2：

+-----+------------------------------+
|store|     values                   |
+-----+------------------------------+
|    1|{1:1, 2:1, 3:1, 4:1, 5:1, 6:1}|
|    2|{2:1, 3:1}                    |
+-----+------------------------------+

Run Code Online (Sandbox Code Playgroud)

但它不起作用。

有谁知道我该怎么做？ …

counter apache-spark apache-spark-sql pyspark

Car*_*llo

2019 01-09

3
推荐指数

1
解决办法

1534
查看次数