Luc*_*hon 7 string text-processing r data.table
这是我关于SO的第一个问题,请告诉我是否可以改进.我正在研究R中的自然语言处理项目,并且正在尝试构建包含测试用例的data.table.在这里,我构建了一个简化的示例:
texts.dt <- data.table(string = c("one",
"two words",
"three words here",
"four useless words here",
"five useless meaningless words here",
"six useless meaningless words here just",
"seven useless meaningless words here just to",
"eigth useless meaningless words here just to fill",
"nine useless meaningless words here just to fill up",
"ten useless meaningless words here just to fill up space"),
word.count = 1:10,
stop.at.word = c(0, 1, 2, 2, 4, 3, 3, 6, 7, 5))
Run Code Online (Sandbox Code Playgroud)
这将返回我们将要处理的data.table:
string word.count stop.at.word
1: one 1 0
2: two words 2 1
3: three words here 3 2
4: four useless words here 4 2
5: five useless meaningless words here 5 4
6: six useless meaningless words here just 6 3
7: seven useless meaningless words here just to 7 3
8: eigth useless meaningless words here just to fill 8 6
9: nine useless meaningless words here just to fill up 9 7
10: ten useless meaningless words here just to fill up space 10 5
Run Code Online (Sandbox Code Playgroud)
在实际应用中,stop.at.word
列中的值是随机确定的(上限= word.count
-1).此外,字符串不按长度排序,但这不应该有所不同.
该代码应添加两列input
和output
,其中input
含有从位置1处的子串到stop.at.word
并output
包含下面的字(一个字),如下所示:
>desired_result
string word.count stop.at.word input
1: one 1 0
2: two words 2 1 two
3: three words here 3 2 three words
4: four useless words here 4 2 four useless
5: five useless meaningless words here 5 4 five useless meaningless words
6: six useless meaningless words here just 6 2 six useless
7: seven useless meaningless words here just to 7 3 seven useless meaningless
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here
output
1:
2: words
3: here
4: words
5: here
6: meaningless
7: words
8: to
9: fill
10: just
Run Code Online (Sandbox Code Playgroud)
不幸的是,我得到的是:
string word.count stop.at.word input output
1: one 1 0
2: two words 2 1 NA NA
3: three words here 3 2 NA NA
4: four useless words here 4 2 NA NA
5: five useless meaningless words here 5 4 NA NA
6: six useless meaningless words here just 6 3 NA NA
7: seven useless meaningless words here just to 7 3 NA NA
8: eigth useless meaningless words here just to fill 8 6 NA NA
9: nine useless meaningless words here just to fill up 9 7 NA NA
10: ten useless meaningless words here just to fill up space 10 5 ten NA
Run Code Online (Sandbox Code Playgroud)
注意结果不一致,第1行为空字符串,第10行返回"10".
这是我正在使用的代码:
texts.dt[, c("input", "output") := .(
substr(string,
1,
sapply(gregexpr(" ", string),"[", stop.at.word) - 1),
substr(string,
sapply(gregexpr(" ", string),"[", stop.at.word),
sapply(gregexpr(" ", string),"[", stop.at.word + 1) - 1)
)]
Run Code Online (Sandbox Code Playgroud)
我运行了很多测试,substr
当我在控制台中尝试单个字符串时,指令运行良好,但在应用于data.table时失败.我怀疑我遗漏了与data.table中的作用域相关的东西,但是我没有长时间使用这个包,所以我很困惑.
我非常感谢一些帮助.提前致谢!
我可能会这样做
texts.dt[stop.at.word > 0, c("input","output") := {
sp = strsplit(string, " ")
list(
mapply(function(p,n) paste(p[seq_len(n)], collapse = " "), sp, stop.at.word),
mapply(`[`, sp, stop.at.word+1L)
)
}]
# partial result
head(texts.dt, 4)
string word.count stop.at.word input output
1: one 1 0 NA NA
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
Run Code Online (Sandbox Code Playgroud)
交替:
library(stringi)
texts.dt[stop.at.word > 0, c("input","output") := {
patt = paste0("((\\w+ ){", stop.at.word-1, "}\\w+) (.*)")
m = stri_match(string, regex = patt)
list(m[, 2], m[, 4])
}]
Run Code Online (Sandbox Code Playgroud)
@ Frank mapply
解决方案的替代方案是使用by = 1:nrow(texts.dt)
with strsplit
和paste
:
library(data.table)
texts.dt[, `:=` (input = paste(strsplit(string, ' ')[[1]][1:stop.at.word][stop.at.word>0],
collapse = " "),
output = strsplit(string, ' ')[[1]][stop.at.word + 1]),
by = 1:nrow(texts.dt)]
Run Code Online (Sandbox Code Playgroud)
这使:
> texts.dt
string word.count stop.at.word input output
1: one 1 0 one
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
5: five useless meaningless words here 5 4 five useless meaningless words here
6: six useless meaningless words here just 6 3 six useless meaningless words
7: seven useless meaningless words here just to 7 3 seven useless meaningless words
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just to
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to fill
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here just
Run Code Online (Sandbox Code Playgroud)
而不是使用[[1]]
你也可以包裹strsplit
在unlist
如下:unlist(strsplit(string, ' '))
(代替strsplit(string, ' ')[[1]]
).这将给你相同的结果.
另外两个选择:
1)使用stringi包:
library(stringi)
texts.dt[, `:=`(input = paste(stri_extract_all_words(string[stop.at.word>0],
simplify = TRUE)[1:stop.at.word],
collapse = " "),
output = stri_extract_all_words(string[stop.at.word>0],
simplify = TRUE)[stop.at.word+1]),
1:nrow(texts.dt)]
Run Code Online (Sandbox Code Playgroud)
2)或从这个答案改编:
texts.dt[stop.at.word>0,
c('input','output') := tstrsplit(string,
split = paste0("(?=(?>\\s+\\S*){",
word.count - stop.at.word,
"}$)\\s"),
perl = TRUE)
][, output := sub('(\\w+).*','\\1',output)]
Run Code Online (Sandbox Code Playgroud)
两者都给:
> texts.dt
string word.count stop.at.word input output
1: one 1 0 NA NA
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
5: five useless meaningless words here 5 4 five useless meaningless words here
6: six useless meaningless words here just 6 3 six useless meaningless words
7: seven useless meaningless words here just to 7 3 seven useless meaningless words
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just to
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to fill
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here just
Run Code Online (Sandbox Code Playgroud)
dt[, `:=`(input = sub(paste0('((\\s*\\w+){', stop.at.word, '}).*'), '\\1', string),
output = sub(paste0('(\\s*\\w+){', stop.at.word, '}\\s*(\\w+).*'), '\\2', string))
, by = stop.at.word][]
# string word.count stop.at.word
# 1: one 1 0
# 2: two words 2 1
# 3: three words here 3 2
# 4: four useless words here 4 2
# 5: five useless meaningless words here 5 4
# 6: six useless meaningless words here just 6 3
# 7: seven useless meaningless words here just to 7 3
# 8: eigth useless meaningless words here just to fill 8 6
# 9: nine useless meaningless words here just to fill up 9 7
#10: ten useless meaningless words here just to fill up space 10 5
# input output
# 1: one
# 2: two words
# 3: three words here
# 4: four useless words
# 5: five useless meaningless words here
# 6: six useless meaningless words
# 7: seven useless meaningless words
# 8: eigth useless meaningless words here just to
# 9: nine useless meaningless words here just to fill
#10: ten useless meaningless words here just
Run Code Online (Sandbox Code Playgroud)
我不确定我是否理解output
第一线上没有任何东西的逻辑,但如果确实需要,那么微不足道的修复留给了OP.
归档时间: |
|
查看次数: |
135 次 |
最近记录: |