从spark中的文本中删除特定字符

Question

从spark中的文本中删除特定字符

我的 Spark 数据框列中有一些奇怪的字符。我想删除它。当我选择该特定列并执行 .show() 时，我看到它如下

\n\n

Dominant technology firm seeks ambitious, assertive, confident, headstrong salesperson to lead our organization into the next era! If you are ready to thrive in a highly competitive environment, this is the job for you. \xc2\xa5 Superior oral and written communication skills\xc2\xa5 Extensive experience with negotiating and closing sales \xc2\xa5 Outspoken \xc2\xa5 Thrives in competitive environment\xc2\xa5 Self-reliant and able to succeed in an independent setting \xc2\xa5 Manage portfolio of clients \xc2\xa5 Aggressively close sales to exceed quarterly quotas \xc2\xa5 Deliver expertise to clients as needed \xc2\xa5 Lead the company into new markets |

\n\n

您看到的字符是\xc2\xa5。

\n\n

我编写了以下代码以将其从数据框的“描述”列中删除

\n\n

from pyspark.sql.functions import udf\n\ncharReplace=udf(lambda x: x.replace(\'\xc2\xa5\',\'\'))\n\ntrain_cleaned=train_triLabel.withColumn(\'dsescription\',charReplace(\'description\'))\ntrain_cleaned.show(2,truncate=False)\n

Run Code Online (Sandbox Code Playgroud)\n\n

然而它会抛出一个错误：

\n\n

File "/Users/i854319/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream\n    vs = list(itertools.islice(iterator, batch))\n  File "/Users/i854319/spark/python/pyspark/sql/functions.py", line 1563, in <lambda>\n    func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)\n  File "<ipython-input-32-864efe6f3257>", line 3, in <lambda>\nUnicodeDecodeError: \'ascii\' codec can\'t decode byte 0xc2 in position 0: ordinal not in range(128)\n

Run Code Online (Sandbox Code Playgroud)\n\n

但是，当我在测试字符串上执行此操作时，替换方法可以识别该字符。

\n\n

s=\'hello \xc2\xa5\'\nprint s\ns.replace(\'\xc2\xa5\',\'\')\n\xe2\x80\x8b\nhello \xc2\xa5\nOut[37]:\n\'hello \'\n

Run Code Online (Sandbox Code Playgroud)\n\n

知道我哪里出错了吗？

\n

Answer 1

use*_*271 5

使用 Unicode 文字：

\n\n

charReplace = udf(lambda x: x.replace(u'\xc2\xa5',''))\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	9 年，1 月前
查看次数：	11473 次
最近记录：	9 年，1 月前