Bak*_*war 3 python apache-spark
我的 Spark 数据框列中有一些奇怪的字符。我想删除它。当我选择该特定列并执行 .show() 时,我看到它如下
\n\nDominant technology firm seeks ambitious, assertive, confident, headstrong salesperson to lead our organization into the next era! If you are ready to thrive in a highly competitive environment, this is the job for you. \xc2\xa5 Superior oral and written communication skills\xc2\xa5 Extensive experience with negotiating and closing sales \xc2\xa5 Outspoken \xc2\xa5 Thrives in competitive environment\xc2\xa5 Self-reliant and able to succeed in an independent setting \xc2\xa5 Manage portfolio of clients \xc2\xa5 Aggressively close sales to exceed quarterly quotas \xc2\xa5 Deliver expertise to clients as needed \xc2\xa5 Lead the company into new markets |
您看到的字符是\xc2\xa5。
\n\n我编写了以下代码以将其从数据框的“描述”列中删除
\n\nfrom pyspark.sql.functions import udf\n\ncharReplace=udf(lambda x: x.replace(\'\xc2\xa5\',\'\'))\n\ntrain_cleaned=train_triLabel.withColumn(\'dsescription\',charReplace(\'description\'))\ntrain_cleaned.show(2,truncate=False)\nRun Code Online (Sandbox Code Playgroud)\n\n然而它会抛出一个错误:
\n\nFile "/Users/i854319/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream\n vs = list(itertools.islice(iterator, batch))\n File "/Users/i854319/spark/python/pyspark/sql/functions.py", line 1563, in <lambda>\n func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)\n File "<ipython-input-32-864efe6f3257>", line 3, in <lambda>\nUnicodeDecodeError: \'ascii\' codec can\'t decode byte 0xc2 in position 0: ordinal not in range(128)\nRun Code Online (Sandbox Code Playgroud)\n\n但是,当我在测试字符串上执行此操作时,替换方法可以识别该字符。
\n\ns=\'hello \xc2\xa5\'\nprint s\ns.replace(\'\xc2\xa5\',\'\')\n\xe2\x80\x8b\nhello \xc2\xa5\nOut[37]:\n\'hello \'\nRun Code Online (Sandbox Code Playgroud)\n\n知道我哪里出错了吗?
\n使用 Unicode 文字:
\n\ncharReplace = udf(lambda x: x.replace(u'\xc2\xa5',''))\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
11473 次 |
| 最近记录: |