如何编写自定义removePunctuation()函数来更好地处理Unicode字符？

Question

如何编写自定义removePunctuation()函数来更好地处理Unicode字符？

在tm文本挖掘R包的源代码中,在文件transform.R中,有一个removePunctuation()函数,当前定义为:

function(x, preserve_intra_word_dashes = FALSE)
{
    if (!preserve_intra_word_dashes)
        gsub("[[:punct:]]+", "", x)
    else {
        # Assume there are no ASCII 1 characters.
        x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x)
        x <- gsub("[[:punct:]]+", "", x)
        gsub("\1", "-", x, fixed = TRUE)
    }
}

Run Code Online (Sandbox Code Playgroud)

我需要从科学会议中解析并挖掘一些摘要(从他们的网站上获取为UTF-8).摘要包含一些需要删除的unicode字符,特别是在字边界处.有通常的ASCII标点字符,但也有一些Unicode破折号,Unicode引号,数学符号......

文本中还有URL,标点符号中需要保留字符标点符号.tm的内置removePunctuation()功能太激进了.

所以我需要一个自定义removePunctuation()功能来根据我的要求进行删除.

我的自定义Unicode函数现在看起来像这样,但它不能按预期工作.我很少使用R,因此在R中完成任务需要一些时间,即使对于最简单的任务也是如此.

我的功能:

corpus <- tm_map(corpus, rmPunc =  function(x){ 
# lookbehinds 
# need to be careful to specify fixed-width conditions 
# so that it can be used in lookbehind

x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{5})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{4})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{3})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{2})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>])([[:alnum:]])'," \\2", x, perl=TRUE) ; 
# lookaheads (can use variable-width conditions) 
x <- gsub('(.*?)(?=[[:alnum:]])([[:punct:]’“”:±]+)$',"\1 ", x, perl=TRUE) ;

# remove all strings that consist *only* of punct chars 
gsub('^[[:punct:]’“”:±</>]+$',"", x, perl=TRUE) ;

}

Run Code Online (Sandbox Code Playgroud)

它没有按预期工作.我想,它什么都不做.标点符号仍在术语 - 文档矩阵中,请参阅:

 head(Terms(tdm), n=30)

  [1] "<></>"                      "---"                       
  [3] "--,"                        ":</>"                      
  [5] ":()"                        "/)."                       
  [7] "/++"                        "/++,"                      
  [9] "..,"                        "..."                       
 [11] "...,"                       "..)"                       
 [13] "“”,"                        "(|)"                       
 [15] "(/)"                        "(.."                       
 [17] "(..,"                       "()=(|=)."                  
 [19] "(),"                        "()."                       
 [21] "(&)"                        "++,"                       
 [23] "(0°"                        "0.001),"                   
 [25] "0.003"                      "=0.005)"                   
 [27] "0.006"                      "=0.007)"                   
 [29] "000km"                      "0.01)" 
...

Run Code Online (Sandbox Code Playgroud)

所以我的问题是:

为什么对我的函数(){}的调用没有达到预期的效果？我的功能如何改进？
都是Unicode正则表达式类,仿佛 \P{ASCII}或\P{PUNCT}R中的Perl兼容的正则表达式的支持？我认为它们(默认情况下)不是PCRE: "只有支持带有\ p的各种Unicode属性才是不完整的,尽管支持最重要的属性."

Answer 1

Joc*_*hen 2

尽管我很喜欢 Susana 的答案，但它破坏了tm新版本中的语料库（不再是 PlainTextDocument 并破坏了元）

\n\n

您将得到一个列表和以下错误：

\n\n

Error in UseMethod("meta", x) : \nno applicable method for \'meta\' applied to an object of class "character"\n

Run Code Online (Sandbox Code Playgroud)\n\n

使用

\n\n

tm_map(your_corpus, PlainTextDocument)\n

Run Code Online (Sandbox Code Playgroud)\n\n

会返回你的语料库，但 $meta 损坏（特别是文档 ID 将会丢失。

\n\n

解决方案

\n\n

使用内容转换器

\n\n

toSpace <- content_transformer(function(x,pattern)\n    gsub(pattern," ", x))\nyour_corpus <- tm_map(your_corpus,toSpace,"\xe2\x80\x9e")\n

Run Code Online (Sandbox Code Playgroud)\n\n

资料来源： \nR 数据科学实践，\n文本挖掘，\nGraham.Williams@togaware.com http://onepager.togaware.com/

\n\n

更新

\n\n

此函数删除所有非字母数字的内容（即 UTF-8 表情符号等）

\n\n

removeNonAlnum <- function(x){\n  gsub("[^[:alnum:]^[:space:]]","",x)\n}\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	12 年，10 月前
查看次数：	5092 次
最近记录：	8 年，5 月前