Pyspark：从 pyspark 数据帧中删除 UTF 空字符

Question

Pyspark：从 pyspark 数据帧中删除 UTF 空字符

Ste*_*eve 5 python postgresql utf-8 apache-spark pyspark

我有一个类似于以下内容的 pyspark 数据框：

df = sql_context.createDataFrame([
  Row(a=3, b=[4,5,6],c=[10,11,12], d='bar', e='utf friendly'),
  Row(a=2, b=[1,2,3],c=[7,8,9], d='foo', e=u'ab\u0000the')
  ])

Run Code Online (Sandbox Code Playgroud)

其中列的值之一e包含 UTF 空字符\u0000。如果我尝试将其加载df到 postgresql 数据库中，则会收到以下错误：

ERROR: invalid byte sequence for encoding "UTF8": 0x00

Run Code Online (Sandbox Code Playgroud)

这是有道理的。在将数据加载到 postgres 之前，如何有效地从 pyspark 数据框中删除空字符？

我尝试pyspark.sql.functions先使用其中一些来清理数据，但没有成功。encode、decode、并regex_replace没有起作用：

df.select(regexp_replace(col('e'), u'\u0000', ''))
df.select(encode(col('e'), 'UTF-8'))
df.select(decode(col('e'), 'UTF-8'))

Run Code Online (Sandbox Code Playgroud)

理想情况下，我想清理整个数据框，而不具体指定哪些列或违规字符是什么，因为我不一定提前知道这些信息。

我正在使用带编码的 postgres 9.4.9 数据库UTF8。

Answer 1

Ste*_*eve 2

啊等等 - 我想我已经有了。如果我做这样的事情，它似乎有效：

null = u'\u0000'
new_df = df.withColumn('e', regexp_replace(df['e'], null, ''))

Run Code Online (Sandbox Code Playgroud)

然后映射到所有字符串列：

string_columns = ['d','e']
new_df = df.select(
  *(regexp_replace(col(c), null, '').alias(c) if c in string_columns else c for
    c in df.columns)
  )

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，2 月前
查看次数：	5156 次
最近记录：	9 年，2 月前