我有一个带有示例文本的文本文件,如下所示:
"venezuela probes ex-oil czar ramirez over alleged graft scheme
caracas/houston (reuters) - venezuela is investigating rafael ramirez, a
once powerful oil minister and former head of state oil company pdvsa, in
connection with an alleged $4.8 billion vienna-based corruption scheme, the
state prosecutor's office announced on friday.
5.5 hours ago
— reuters
amazon ordered not to pull in customers who can't spell `birkenstock'
a german court has ordered amazon not to lure internet shoppers to its
online marketplace when they mistakenly search for "brikenstock",
"birkenstok", "bierkenstock" and other variations in google.
6 hours ago
— business standard"
Run Code Online (Sandbox Code Playgroud)
我在R中需要的是将这两段文本分开.
第一段文本将与text1变量对应,第二段文本应与text2变量对应.
请记住,我在这个文件中有很多类似文本的段落.解决方案必须适用于100,000个文本.
我认为唯一可以用作分隔符的是" - ",但是由此我失去了诸如"路透社"或"商业标准"之类的信息来源.我也需要它.
你知道怎么在R中完成这个吗?
从字段中读取文本readLines,然后在发布者的特殊破折号出现的移位的集合上拆分:
Lines <- readLines("Lines.txt") # from file in wd()
split(Lines, cumsum(c(0, head(grepl("—", Lines),-1))) )
#--------------
$`0`
[1] "venezuela probes ex-oil czar ramirez over alleged graft scheme"
[2] "caracas/houston (reuters) - venezuela is investigating rafael ramirez, a "
[3] "once powerful oil minister and former head of state oil company pdvsa, in "
[4] "connection with an alleged $4.8 billion vienna-based corruption scheme, the "
[5] "state prosecutor's office announced on friday."
[6] "5.5 hours ago"
[7] "— reuters"
$`1`
[1] "amazon ordered not to pull in customers who can't spell `birkenstock'"
[2] "a german court has ordered amazon not to lure internet shoppers to its "
[3] "online marketplace when they mistakenly search for \"brikenstock\", "
[4] "\"birkenstok\", \"bierkenstock\" and other variations in google."
[5] "6 hours ago"
[6] "— business standard'"
Run Code Online (Sandbox Code Playgroud)
这不是常规的" - ".它是 "-".并注意默认情况下readLines将省略空行.