从R中的字符串中删除表情符号

the*_*ide 8 regex unicode twitter r substitution

我有一个推文列表,其中许多包含需要删除的表情符号.在R中这样做最有效的方法是什么?

我尝试了以下方法,它应该用"空白"替换以"\"开头的所有单词,但是我收到此错误

some_tweets <- gsub("\\\w+ *", "", some_tweets)
Error: '\w' is an unrecognized escape in character string starting ""\\\w"
Run Code Online (Sandbox Code Playgroud)

以下是数据示例:

> head(some_tweets)
[1] "??? ???? ??????? ????? \U0001f625\U0001f625\U0001f625"                               
[2] "?????? ?????????? \U0001f913\U0001f913\U0001f913"                                  
[3] "???? ????? ?????? ??????? \U0001f602\U0001f602\U0001f602\U0001f602"                        
[4] "???"                                                                           
[5] "RT : ?????????? ??????.. ~ ?????"                                                      
[6] "?????? ??????? ???? ??????????? ????????? ????????? ??????? ????????? \U0001f608\U0001f608\U0001f608"


> dput(head(some_tweets))
c("??? ???? ??????? ????? \U0001f625\U0001f625\U0001f625", 
"?????? ?????????? \U0001f913\U0001f913\U0001f913", 
"???? ????? ?????? ??????? \U0001f602\U0001f602\U0001f602\U0001f602", 
"???", "RT : ?????????? ??????.. ~ ?????", 
"?????? ??????? ???? ??????????? ????????? ????????? ??????? ????????? \U0001f608\U0001f608\U0001f608"
)
Run Code Online (Sandbox Code Playgroud)

ali*_*ire 11

查看Unicode上的regular-expressions.info,它在正则表达式中对Unicode进行了详尽的解释.这里重要的部分是你可以匹配Unicode字符\p{xx},在哪里xx是他们所在的类的名称(例如,L对于字母,M对于标记).在这里,你的表情符号似乎是So(简写Other_Symbol)和Cn(简写Unassigned)类,所以我们可以用它们来表示它们:

gsub('\\p{So}|\\p{Cn}', '', some_tweets, perl = TRUE)
## [1] "??? ???? ??????? ????? "                                       
## [2] "?????? ?????????? "                                           
## [3] "???? ????? ?????? ??????? "                                       
## [4] "???"                                                        
## [5] "RT : ?????????? ??????.. ~ ?????"                               
## [6] "?????? ??????? ???? ??????????? ????????? ????????? ??????? ????????? "
Run Code Online (Sandbox Code Playgroud)

请注意,您需要perl = TRUE设置,因为在R的默认POSIX 1003.2正则表达式中未启用此表示法; 看到?base::regex?grep.