'utf8towcs'中的r tm包无效输入

Question

'utf8towcs'中的r tm包无效输入

我正在尝试使用R中的tm包来执行一些文本分析.我绑了以下内容:

require(tm)
dataSet <- Corpus(DirSource('tmp/'))
dataSet <- tm_map(dataSet, tolower)
Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)?lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'

Run Code Online (Sandbox Code Playgroud)

问题是某些字符无效.我想从R中或在导入文件进行处理之前从分析中排除无效字符.

我尝试使用iconv将所有文件转换为utf-8并排除任何无法转换为的内容,如下所示:

find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" \;

Run Code Online (Sandbox Code Playgroud)

正如在此指出的那样使用iconv将latin-1文件批量转换为utf-8

但我仍然得到同样的错误.

我很感激任何帮助.

Answer 1

小智 57

上述答案都不适合我.解决此问题的唯一方法是删除所有非图形字符(http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).

代码很简单

usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")

Run Code Online (Sandbox Code Playgroud)

这应该标记为解决方案。它可以工作，并且已经流行了很多年，但是OP并没有坚持下去以将其标记为正确。 (2认同)
作为使用基本 r 的替代方案，您可以尝试： `usableText <- iconv(tweets$text, "ASCII", "UTF-8", sub="")` (2认同)

Answer 2

小智 24

这是来自tm faq:

它将用显示其十六进制代码的字符串替换yourCorpus中的不可转换字节.

我希望这对我有所帮助.

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

Run Code Online (Sandbox Code Playgroud)

http://tm.r-forge.r-project.org/faq.html

Answer 3

Sau*_*dav 13

我认为现在很清楚,问题是由于tolo无法理解的表情符号

#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')

Run Code Online (Sandbox Code Playgroud)

Answer 4

Ken*_*ton 10

我刚刚与这个问题发生冲突.您是否正在使用运行OSX的计算机？我似乎已经将问题追溯到R在此操作系统上编译的字符集的定义(请参阅https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374 .html)

我所看到的是使用FAQ中的解决方案

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

Run Code Online (Sandbox Code Playgroud)

给了我这个警告:

Warning message:
it is not known that wchar_t is Unicode on this platform

Run Code Online (Sandbox Code Playgroud)

我追溯到这个enc2utf8功能.坏消息是这是我的底层操作系统的问题,而不是R.

所以这就是我作为一个解决方案所做的:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

Run Code Online (Sandbox Code Playgroud)

这迫使iconv在macintosh上使用utf8编码,并且无需重新编译即可正常工作.

Answer 5

Jac*_*lis 7

我经常遇到这个问题，这篇 Stack Overflow 帖子总是最先出现的。我以前使用过顶级解决方案，但它可以去除字符并用垃圾替换它们（例如转换it’s为itâ€™s）。

我发现实际上有一个更好的解决方案！如果您安装该stringi软件包，则可以替换tolower()为stri_trans_tolower()，然后一切正常。

Answer 6

Ben*_*Ben 2

tm这是包 ( 1 , 2 , 3 )的常见问题。

\n\n

解决此问题的一种非R方法是使用文本编辑器查找并替换文本中的所有花哨字符（即带有变音符号的字符），然后再将其加载到R（或gsub在中使用R）。例如，您可以搜索并替换 \xc3\x96l-Teppich 中 O 元音变音的所有实例。其他人已经在这方面取得了成功（我也有），但如果您有数千个单独的文本文件，显然这是不好的。

\n\n

对于R解决方案，我发现使用VectorSource代替DirSource似乎可以解决问题：

\n\n

# I put your example text in a file and tested it with both ANSI and \n# UTF-8 encodings, both enabled me to reproduce your problem\n#\ntmp <- Corpus(DirSource(\'C:\\\\...\\\\tmp/\'))\ntmp <- tm_map(dataSet, tolower)\nError in FUN(X[[1L]], ...) : \n  invalid input \'RT @noXforU Erneut riesiger (Alt-)\xc3\x96l\xe2\x80\x93teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp\' in \'utf8towcs\'\n# quite similar error to what you got, both from ANSI and UTF-8 encodings\n#\n# Now try VectorSource instead of DirSource\ntmp <- readLines(\'C:\\\\...\\\\tmp.txt\') \ntmp\n[1] "RT @noXforU Erneut riesiger (Alt-)\xc3\x96l\xe2\x80\x93teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp"\n# looks ok so far\ntmp <- Corpus(VectorSource(tmp))\ntmp <- tm_map(tmp, tolower)\ntmp[[1]]\nrt @noxforu erneut riesiger (alt-)\xc3\xb6l\xe2\x80\x93teppich im golf von mexiko (#pics vom freitag) http://bit.ly/bw1hvu http://bit.ly/9r7jcf #oilspill #bp\n# seems like it\'s worked just fine. It worked for best for ANSI encoding. \n# There was no error with UTF-8 encoding, but the \xc3\x96 was returned \n# as \xc3\xa3\xe2\x80\x93 which is not good\n

Run Code Online (Sandbox Code Playgroud)\n\n

但这似乎有点幸运的巧合。一定有更直接的方法。请让我们知道什么对您有用！

\n

归档时间：	13 年，11 月前
查看次数：	43176 次
最近记录：	7 年，9 月前