我正在研究R中的Twitter数据集,我发现很难从推文中删除用户名.
这是我的数据集的tweet列中的推文的示例:
[1] "@danimottale: 2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."
[2] "@FreeMktMonkey @drleegross Want to build HSA throughout lifetime for when older thus need HDHP not to deplete it if ill before 65y/o.thanks"
Run Code Online (Sandbox Code Playgroud)
我想删除/替换以"@"开头的所有单词以获得此输出:
[1] "2 bad our inalienable rights offend their sensitivities. U cannot reason with obtuse zealotry. // So very well said."
[2] "Want to build HSA throughout lifetime for when older thus need HDHP …Run Code Online (Sandbox Code Playgroud) 我知道如何单独删除标点并保留撇号:
gsub( "[^[:alnum:]']", " ", db$text )
Run Code Online (Sandbox Code Playgroud)
或者如何使用tm包保持字内短划线:
removePunctuation(db$text, preserve_intra_word_dashes = TRUE)
Run Code Online (Sandbox Code Playgroud)
但我无法找到同时做到这两点的方法.例如,如果我的原始句子是:
"Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"
Run Code Online (Sandbox Code Playgroud)
我希望它是:
"Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
Run Code Online (Sandbox Code Playgroud)
当然,会有额外的空白区域,但我可以在以后删除它们.
我将非常感谢你的帮助.
我之前曾问过类似的问题,但这个问题更具体,需要与之前提供的解决方案不同的解决方案,所以我希望发布它是可以的.我需要在我的文本中仅保留撇号和字内短划线(删除所有其他标点符号).例如,我想从str1获取str2:
str1<-"I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
str2<-"I'm dash before word word dash in-between word two before word word just dashes between words word word"
Run Code Online (Sandbox Code Playgroud)
我到目前为止的解决方案,首先删除单词之间的破折号:
gsub(" - ", " ", str1)
然后留下字母和数字字符加上剩余的破折号
gsub("[^[:alnum:]['-]", " ", str1)
问题是,它不会删除相互之间的破折号,例如" - "和单词开头和结尾的破折号:" - word"或"word-"