knb*_*knb 7 unicode r text-mining tm
在tm文本挖掘R包的源代码中,在文件transform.R中,有一个removePunctuation()
函数,当前定义为:
function(x, preserve_intra_word_dashes = FALSE)
{
if (!preserve_intra_word_dashes)
gsub("[[:punct:]]+", "", x)
else {
# Assume there are no ASCII 1 characters.
x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x)
x <- gsub("[[:punct:]]+", "", x)
gsub("\1", "-", x, fixed = TRUE)
}
}
Run Code Online (Sandbox Code Playgroud)
我需要从科学会议中解析并挖掘一些摘要(从他们的网站上获取为UTF-8).摘要包含一些需要删除的unicode字符,特别是在字边界处.有通常的ASCII标点字符,但也有一些Unicode破折号,Unicode引号,数学符号......
文本中还有URL,标点符号中需要保留字符标点符号.tm的内置removePunctuation()
功能太激进了.
所以我需要一个自定义removePunctuation()
功能来根据我的要求进行删除.
我的自定义Unicode函数现在看起来像这样,但它不能按预期工作.我很少使用R,因此在R中完成任务需要一些时间,即使对于最简单的任务也是如此.
我的功能:
corpus <- tm_map(corpus, rmPunc = function(x){
# lookbehinds
# need to be careful to specify fixed-width conditions
# so that it can be used in lookbehind
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{5})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{4})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{3})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{2})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>])([[:alnum:]])'," \\2", x, perl=TRUE) ;
# lookaheads (can use variable-width conditions)
x <- gsub('(.*?)(?=[[:alnum:]])([[:punct:]’“”:±]+)$',"\1 ", x, perl=TRUE) ;
# remove all strings that consist *only* of punct chars
gsub('^[[:punct:]’“”:±</>]+$',"", x, perl=TRUE) ;
}
Run Code Online (Sandbox Code Playgroud)
它没有按预期工作.我想,它什么都不做.标点符号仍在术语 - 文档矩阵中,请参阅:
head(Terms(tdm), n=30)
[1] "<></>" "---"
[3] "--," ":</>"
[5] ":()" "/)."
[7] "/++" "/++,"
[9] "..," "..."
[11] "...," "..)"
[13] "“”," "(|)"
[15] "(/)" "(.."
[17] "(..," "()=(|=)."
[19] "()," "()."
[21] "(&)" "++,"
[23] "(0°" "0.001),"
[25] "0.003" "=0.005)"
[27] "0.006" "=0.007)"
[29] "000km" "0.01)"
...
Run Code Online (Sandbox Code Playgroud)
所以我的问题是:
\P{ASCII}
或\P{PUNCT}
R中的Perl兼容的正则表达式的支持?我认为它们(默认情况下)不是PCRE: "只有支持带有\ p的各种Unicode属性才是不完整的,尽管支持最重要的属性."尽管我很喜欢 Susana 的答案,但它破坏了tm新版本中的语料库(不再是 PlainTextDocument 并破坏了元)
\n\n您将得到一个列表和以下错误:
\n\nError in UseMethod("meta", x) : \nno applicable method for \'meta\' applied to an object of class "character"\n
Run Code Online (Sandbox Code Playgroud)\n\n使用
\n\ntm_map(your_corpus, PlainTextDocument)\n
Run Code Online (Sandbox Code Playgroud)\n\n会返回你的语料库,但 $meta 损坏(特别是文档 ID 将会丢失。
\n\n解决方案
\n\n使用内容转换器
\n\ntoSpace <- content_transformer(function(x,pattern)\n gsub(pattern," ", x))\nyour_corpus <- tm_map(your_corpus,toSpace,"\xe2\x80\x9e")\n
Run Code Online (Sandbox Code Playgroud)\n\n资料来源: \nR 数据科学实践,\n文本挖掘,\nGraham.Williams@togaware.com http://onepager.togaware.com/
\n\n此函数删除所有非字母数字的内容(即 UTF-8 表情符号等)
\n\nremoveNonAlnum <- function(x){\n gsub("[^[:alnum:]^[:space:]]","",x)\n}\n
Run Code Online (Sandbox Code Playgroud)\n
归档时间: |
|
查看次数: |
5092 次 |
最近记录: |