相关疑难解决方法(0)

TypeError：列不可迭代-如何遍历ArrayType（）？

考虑以下DataFrame：

+------+-----------------------+
|type  |names                  |
+------+-----------------------+
|person|[john, sam, jane]      |
|pet   |[whiskers, rover, fido]|
+------+-----------------------+

Run Code Online (Sandbox Code Playgroud)

可以使用以下代码创建：

import pyspark.sql.functions as f
data = [
    ('person', ['john', 'sam', 'jane']),
    ('pet', ['whiskers', 'rover', 'fido'])
]

df = sqlCtx.createDataFrame(data, ["type", "names"])
df.show(truncate=False)

Run Code Online (Sandbox Code Playgroud)

有没有一种方法可以通过对每个元素应用函数而不使用？来直接修改ArrayType()列？"names"udf

例如，假设我想将该函数foo应用于"names"列。（我将使用其中的例子foo是str.upper只用于说明目的，但我的问题是关于可以应用到一个可迭代的元素任何有效的功能。）

foo = lambda x: x.upper()  # defining it as str.upper as an example
df.withColumn('X', [foo(x) for x in f.col("names")]).show()

Run Code Online (Sandbox Code Playgroud)

TypeError：列不可迭代

我可以使用udf：

foo_udf = f.udf(lambda row: [foo(x) …

Run Code Online (Sandbox Code Playgroud)

apache-spark pyspark spark-dataframe pyspark-sql

pau*_*ult

2018 03-30

9
推荐指数

1
解决办法

4438
查看次数

从PySpark DataFrame中的Python列表中删除元素

我试图从Python列表中删除一个元素:

+---------------+
|        sources|
+---------------+
|           [62]|
|        [7, 32]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|    [7, 32, 62]|

Run Code Online (Sandbox Code Playgroud)

我希望能够rm从上面列表中的每个列表中删除元素.我写了一个函数,可以为列表列表做到这一点:

def asdf(df, rm):
    temp = df
    for n in range(len(df)):
        temp[n] = [x for x in df[n] if x != rm]
    return(temp)

Run Code Online (Sandbox Code Playgroud)

哪个删除rm = 1:

a = [[1,2,3],[1,2,3,4],[1,2,3,4,5]]
In:  asdf(a,1)
Out: [[2, 3], [2, 3, 4], [2, 3, 4, 5]]

Run Code Online (Sandbox Code Playgroud)

但我不能让它适用于DataFrame:

asdfUDF = udf(asdf, ArrayType(IntegerType()))

In: df.withColumn("src_ex", asdfUDF("sources", 32))

Out: Py4JError: …

Run Code Online (Sandbox Code Playgroud)

python apache-spark apache-spark-sql pyspark pyspark-sql

use*_*916

2019 01-16

4
推荐指数

1
解决办法

3982
查看次数

在pyspark中查找和删除匹配的列值

我有一个pyspark数据框,偶尔会有一列与另一列匹配的错误值.它看起来像这样:

| Date         | Latitude      |
| 2017-01-01   | 43.4553       |
| 2017-01-02   | 42.9399       |
| 2017-01-03   | 43.0091       |
| 2017-01-04   | 2017-01-04    |

Run Code Online (Sandbox Code Playgroud)

显然,最后一个纬度值是不正确的.我需要删除任何和所有这样的行.我想过使用,.isin()但我似乎无法让它工作.如果我试试