Roz*_*ita 7 regex scala punctuation apache-spark
这是我的数据的一个示例:
case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time)
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).
Run Code Online (Sandbox Code Playgroud)
我想删除除点(.)之外的所有标点符号,并删除单词length < = 2,例如我的预期输出是:
case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25 .
Run Code Online (Sandbox Code Playgroud)
这应该在Scala中实现,我尝试过:
replaceAll( """\\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")
Run Code Online (Sandbox Code Playgroud)
但效果不好,任何人都能帮帮我吗?
Rég*_*les 23
看看正则表达式javadoc(http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html),我们看到标点符号的字符类是\p{Punct},我们可以删除一个使用某些东西作为字符类的字符[a-z&&[^def]].从那时起,很容易定义一个正则表达式,它将删除除点之外的所有标点符号:
s.replaceAll("""[\p{Punct}&&[^.]]""", "")
Run Code Online (Sandbox Code Playgroud)
删除大小<= 2的单词可以这样做:
s.replaceAll("""\b\p{IsLetter}{1,2}\b""")
Run Code Online (Sandbox Code Playgroud)
结合这两者,这给出了:
s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Run Code Online (Sandbox Code Playgroud)
请注意我添加\s*如何删除冗余空格.
此外,您可以看到上面的正则表达式完全删除了'$',因为它是一个标点字符(由unicode定义).如果这是不合需要的(似乎表明您的预期输出),请更准确地考虑标点符号.例如,您可能只想将以下字符视为标点符号?.!:():
s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Run Code Online (Sandbox Code Playgroud)
或者,你可以在你的"not-punctuation"字符列表中添加'$'以及点:
s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
16582 次 |
| 最近记录: |