R Programming Need基于字符串的独特解决方案,用于分割大文本

Question

R Programming Need基于字符串的独特解决方案,用于分割大文本

我有一个带有示例文本的文本文件,如下所示:

"venezuela probes ex-oil czar ramirez over alleged graft scheme
caracas/houston (reuters) - venezuela is investigating rafael ramirez, a 
once powerful oil minister and former head of state oil company pdvsa, in 
connection with an alleged $4.8 billion vienna-based corruption scheme, the 
state prosecutor's office announced on friday.


5.5 hours ago
— reuters


amazon ordered not to pull in customers who can't spell `birkenstock'
a german court has ordered amazon not to lure internet shoppers to its 
online marketplace when they mistakenly search for "brikenstock", 
"birkenstok", "bierkenstock" and other variations in google.


6 hours ago
— business standard"

Run Code Online (Sandbox Code Playgroud)

我在R中需要的是将这两段文本分开.

第一段文本将与text1变量对应,第二段文本应与text2变量对应.

请记住,我在这个文件中有很多类似文本的段落.解决方案必须适用于100,000个文本.

我认为唯一可以用作分隔符的是" - ",但是由此我失去了诸如"路透社"或"商业标准"之类的信息来源.我也需要它.

你知道怎么在R中完成这个吗？

Answer 1

42-*_*42- 6

从字段中读取文本readLines,然后在发布者的特殊破折号出现的移位的集合上拆分:

 Lines <- readLines("Lines.txt")  # from file in wd()
 split(Lines, cumsum(c(0, head(grepl("—", Lines),-1))) )
#--------------
$`0`
[1] "venezuela probes ex-oil czar ramirez over alleged graft scheme"              
[2] "caracas/houston (reuters) - venezuela is investigating rafael ramirez, a "   
[3] "once powerful oil minister and former head of state oil company pdvsa, in "  
[4] "connection with an alleged $4.8 billion vienna-based corruption scheme, the "
[5] "state prosecutor's office announced on friday."                              
[6] "5.5 hours ago"                                                               
[7] "— reuters"                                                                   

$`1`
[1] "amazon ordered not to pull in customers who can't spell `birkenstock'"  
[2] "a german court has ordered amazon not to lure internet shoppers to its "
[3] "online marketplace when they mistakenly search for \"brikenstock\", "   
[4] "\"birkenstok\", \"bierkenstock\" and other variations in google."       
[5] "6 hours ago"                                                            
[6] "— business standard'"

Run Code Online (Sandbox Code Playgroud)

这不是常规的" - ".它是 "-".~~并注意默认情况下readLines将省略空行.~~

归档时间：	8 年，1 月前
查看次数：	60 次
最近记录：	8 年前