the*_*ide 8 regex unicode twitter r substitution
我有一个推文列表,其中许多包含需要删除的表情符号.在R中这样做最有效的方法是什么?
我尝试了以下方法,它应该用"空白"替换以"\"开头的所有单词,但是我收到此错误
some_tweets <- gsub("\\\w+ *", "", some_tweets)
Error: '\w' is an unrecognized escape in character string starting ""\\\w"
Run Code Online (Sandbox Code Playgroud)
以下是数据示例:
> head(some_tweets)
[1] "??? ???? ??????? ????? \U0001f625\U0001f625\U0001f625"
[2] "?????? ?????????? \U0001f913\U0001f913\U0001f913"
[3] "???? ????? ?????? ??????? \U0001f602\U0001f602\U0001f602\U0001f602"
[4] "???"
[5] "RT : ?????????? ??????.. ~ ?????"
[6] "?????? ??????? ???? ??????????? ????????? ????????? ??????? ????????? \U0001f608\U0001f608\U0001f608"
> dput(head(some_tweets))
c("??? ???? ??????? ????? \U0001f625\U0001f625\U0001f625",
"?????? ?????????? \U0001f913\U0001f913\U0001f913",
"???? ????? ?????? ??????? \U0001f602\U0001f602\U0001f602\U0001f602",
"???", "RT : ?????????? ??????.. ~ ?????",
"?????? ??????? ???? ??????????? ????????? ????????? ??????? ????????? \U0001f608\U0001f608\U0001f608"
)
Run Code Online (Sandbox Code Playgroud)
ali*_*ire 11
查看Unicode上的regular-expressions.info,它在正则表达式中对Unicode进行了详尽的解释.这里重要的部分是你可以匹配Unicode字符\p{xx},在哪里xx是他们所在的类的名称(例如,L对于字母,M对于标记).在这里,你的表情符号似乎是So(简写Other_Symbol)和Cn(简写Unassigned)类,所以我们可以用它们来表示它们:
gsub('\\p{So}|\\p{Cn}', '', some_tweets, perl = TRUE)
## [1] "??? ???? ??????? ????? "
## [2] "?????? ?????????? "
## [3] "???? ????? ?????? ??????? "
## [4] "???"
## [5] "RT : ?????????? ??????.. ~ ?????"
## [6] "?????? ??????? ???? ??????????? ????????? ????????? ??????? ????????? "
Run Code Online (Sandbox Code Playgroud)
请注意,您需要perl = TRUE设置,因为在R的默认POSIX 1003.2正则表达式中未启用此表示法; 看到?base::regex和?grep.