使用正则表达式删除字符串产生特殊字符:â

n1k*_*1t4 0 r

精简版:

我有很多很多.txt有一些不需要的字符的文件â和点缀无处不有使用正则表达式来删除URL和之后的空白.我需要从所有文件中删除所有这些.

这些â不存在清洗文件之前,它们被产生作为清洁的结果.

长版

我发现了一个适用于我的文本的正则表达式,并且正在删除URL.首先,我的清洁过程(注释掉的线条是我尝试的其他东西):

clean_file <-  sapply(curr_file, function(x) {
    gsub("&amp;", "&", x) %>%
        gsub("http\\S+\\s*", "", .) %>%
        gsub("[^[:alpha:][:space:]&']", "", .) %>%
        #gsub("[^[:alnum:][:space:]\\'-]", "", .) %>%
        stripWhitespace() %>%
        gsub("^ ", "", .) %>%
        gsub(" $", "", .)
        #gsub("â", "", .)
})
Run Code Online (Sandbox Code Playgroud)

示例输入文本(每行是一个字符串):

Gluskin’s Rosenberg: Don’t Bet on a Bear Market for Treasurys -  Rising Treasury yields?... http://j.mp/UVM31t   #FederalReserve
Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market: Large investment asset losses can be… http://goo.gl/fb/cgzGv 
Thank You http://pages.townhall.com/campaign/will-2013-be-a-bull-or-bear-market …  via @townhallcom
Calif. GHG cap-and-trade: a bull or a bear market? http://bit.ly/VG9DTr 
Run Code Online (Sandbox Code Playgroud)

不幸的是它没有出现在这里,但在上面的文本中也有一些非标准字符,即\302. R可以像这样看到它们:

> x = _                                   <-- appears as an underscore in my text editor
Error: object '\302' not found
Run Code Online (Sandbox Code Playgroud)

这可能是因为他们来自shift+space,这里暗示,但他们是我的数据的假象,所以我需要将其删除-我不能阻止他们.

生成的输出(在保存的.txt文件中可见):

Gluskinâs Rosenberg Donât Bet on a Bear Market for Treasurys - Rising Treasury yields FederalReserve
Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can beâ
Thank You â via townhallcom
Calif GHG cap-and-trade a bull or a bear market
Run Code Online (Sandbox Code Playgroud)

输出在R控制台中可见:

> head(clean_file)
      ..text                                                                                                        
[1,] "Nice bear market rally for the Lakers NBA"                                                                    
[2,] "Commented on StockTwits your scenario is entirely possible and as long as SPX doesn't exceed the bear market" 
[3,] "Gluskin\342s Rosenberg Don\342t Bet on a Bear Market for Treasurys Rising Treasury yields FederalReserve"           
[4,] "Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can be\342"
[5,] "Thank You \342 via townhallcom"
[6,] "Calif GHG capandtrade a bull or a bear market"
Run Code Online (Sandbox Code Playgroud)

在我将此视为编码问题之前,只需更换字符就会失败:

gsub("â", "", myText)
Run Code Online (Sandbox Code Playgroud)

我已经尝试了一些其他解决方案来改变文件的编码(在这里解决方案中找到)我试图写入文件强制输出的编码fileEncoding = 'ascii'而不是默认的utf-8(我相信),但是ascii只是给了我警告并截断了很多行,留下一些完全空洞.此外,还似乎没有被删除那些线和其中间的任何相关â字符早先出现.

我可以尝试防止在将来写作时创建这些字符吗?

G. *_*eck 5

这只保留从十六进制到十六进制7f Lines的字符,其中是一个字符向量,其组件是文件的行:

gsub("[^\\x{00}-\\x{7f}]", "", Lines, perl = TRUE)
Run Code Online (Sandbox Code Playgroud)