我有一个数据框,有许多术语(不同大小的ngrams,最多五格)和它们各自的频率:
df = data.frame(term = c("a", "a a", "a a card", "a a card base", "a a card base ne",
"a a divorce", "a a divorce lawyer", "be", "be the", "be the one"),
freq = c(131, 13, 3, 2, 1, 1, 1, 72, 17, 5))
Run Code Online (Sandbox Code Playgroud)
哪个给我们:
term freq
1 a 131
2 a a 13
3 a a card 3
4 a a card base 2
5 a a card base ne 1
6 a a divorce 1
7 a a divorce lawyer 1
8 be 72
9 be the 17
10 be the one 5
Run Code Online (Sandbox Code Playgroud)
我想要的是将unigrams(只有一个单词的术语),bigrams(含有两个单词的术语),trigrams,fourgrams和fivegrams分成不同的数据框:
例如,仅包含unigrams的"df1"将如下所示:
term freq
1 a 131
2 be 72
Run Code Online (Sandbox Code Playgroud)
"df2"(双子座):
term freq
1 a a 13
2 be the 17
Run Code Online (Sandbox Code Playgroud)
"df3"(三卦):
term freq
1 a a card 3
2 a a divorce 1
3 be the one 5
Run Code Online (Sandbox Code Playgroud)
等等.任何的想法?正则表达式可能吗?
您可以按空间计数进行拆分,即
split(df, stringr::str_count(df$term, '\\s+'))
#$`0`
# term freq
#1 a 131
#8 be 72
#$`1`
# term freq
#2 a a 13
#9 be the 17
#$`2`
# term freq
#3 a a card 3
#6 a a divorce 1
#10 be the one 5
#$`3`
# term freq
#4 a a card base 2
#7 a a divorce lawyer 1
#$`4`
# term freq
#5 a a card base ne 1
Run Code Online (Sandbox Code Playgroud)
一个单独的基础R解决方案(如@akrun提到),将是,
split(df, lengths(gregexpr("\\S+", df$term)))
Run Code Online (Sandbox Code Playgroud)