Pyspark如何在Rdd中删除标点符号并变成小写字母？

Question

Pyspark如何在Rdd中删除标点符号并变成小写字母？

mel*_*lik 5 lowercase punctuation pyspark

我想删除标点符号并将RDD中的字母变成小写？以下是我的数据集

 l=sc.parallelize(["How are you","Hello\ then% you"\
,"I think he's fine+ COMING"])

Run Code Online (Sandbox Code Playgroud)

我尝试了以下功能，但收到错误消息

punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

def lower_clean_str(x):
    lowercased_str = x.lower()
    clean_str = lowercased_str.translate(punc) 
    return clean_str

one_RDD = l.flatMap(lambda x: lower_clean_str(x).split())
one_RDD.collect()

Run Code Online (Sandbox Code Playgroud)

但这给了我一个错误。可能是什么问题？我怎样才能解决这个问题？谢谢。

Answer 1

Gau*_*hah 6

您以错误的方式使用 python 翻译功能。由于我不确定您使用的是 python 2.7 还是 python 3，我建议采用替代方法。

python 3 中的翻译函数发生了一些变化。

无论 python 版本如何，以下代码都将起作用。

def lower_clean_str(x):
  punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
  lowercased_str = x.lower()
  for ch in punc:
    lowercased_str = lowercased_str.replace(ch, '')
  return lowercased_str

l=sc.parallelize(["How are you","Hello\ then% you","I think he's fine+ COMING"])
one_RDD = l.map(lower_clean_str)
one_RDD.collect()

Run Code Online (Sandbox Code Playgroud)

输出：

[“你好吗”、“你好”、“我想他来得很好”]

归档时间：	7 年，3 月前
查看次数：	11062 次
最近记录：	7 年，2 月前