删除标点符号在Scala - Spark中形成文本

Roz*_*ita 7 regex scala punctuation apache-spark

这是我的数据的一个示例:

case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).
Run Code Online (Sandbox Code Playgroud)

我想删除除点(.)之外的所有标点符号,并删除单词length < = 2,例如我的预期输出是:

case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25 .
Run Code Online (Sandbox Code Playgroud)

这应该在Scala中实现,我尝试过:

replaceAll( """\\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")
Run Code Online (Sandbox Code Playgroud)

但效果不好,任何人都能帮帮我吗?

Rég*_*les 23

看看正则表达式javadoc(http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html),我们看到标点符号的字符类是\p{Punct},我们可以删除一个使用某些东西作为字符类的字符[a-z&&[^def]].从那时起,很容易定义一个正则表达式,它将删除除点之外的所有标点符号:

s.replaceAll("""[\p{Punct}&&[^.]]""", "")
Run Code Online (Sandbox Code Playgroud)

删除大小<= 2的单词可以这样做:

s.replaceAll("""\b\p{IsLetter}{1,2}\b""")
Run Code Online (Sandbox Code Playgroud)

结合这两者,这给出了:

s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Run Code Online (Sandbox Code Playgroud)

请注意我添加\s*如何删除冗余空格.

此外,您可以看到上面的正则表达式完全删除了'$',因为它一个标点字符(由unicode定义).如果这是不合需要的(似乎表明您的预期输出),请更准确地考虑标点符号.例如,您可能只想将以下字符视为标点符号?.!:():

s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Run Code Online (Sandbox Code Playgroud)

或者,你可以在你的"not-punctuation"字符列表中添加'$'以及点:

s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Run Code Online (Sandbox Code Playgroud)