相关疑难解决方法(0)

在DataFrame中用空/空值替换空字符串

我有一个Spark 1.5.0 DataFrame,null在同一列中混合了空字符串.我想将所有列中的所有空字符串转换为null(None在Python中).DataFrame可能有数百列,所以我试图避免每列的硬编码操作.

请参阅下面的我的尝试,这会导致错误.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo|   1|
## |    |   2|
## |null|null|
## +----+----+

## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple

## A string value of …

Run Code Online (Sandbox Code Playgroud)

python dataframe apache-spark apache-spark-sql pyspark

dnl*_*rky

2015 10-26

21
推荐指数

4
解决办法

3万
查看次数

在PySpark中过滤带有空数组的行

我们正在尝试使用PySpark在字段中过滤包含空数组的行.这是DF的架构:

root
 |-- created_at: timestamp (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- text: string (nullable = true)
 |-- retweet_count: long (nullable = true)
 |-- favorite_count: long (nullable = true)
 |-- in_reply_to_status_id: long (nullable = true)
 |-- in_reply_to_user_id: long (nullable = true)
 |-- in_reply_to_screen_name: string (nullable = true)
 |-- user_mentions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- id_str: string (nullable = true)
 | …

Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql pyspark spark-dataframe

tod*_*ysm

2017 03-24

4
推荐指数

1
解决办法

4495
查看次数