根据ngrams的长度逐行子集数据

Ant*_*ano 3 split r dataframe

我有一个数据框,有许多术语(不同大小的ngrams,最多五格)和它们各自的频率:

df = data.frame(term = c("a", "a a", "a a card", "a a card base", "a a card base ne",
                         "a a divorce", "a a divorce lawyer", "be", "be the", "be the one"), 
                freq = c(131, 13, 3, 2, 1, 1, 1, 72, 17, 5))
Run Code Online (Sandbox Code Playgroud)

哪个给我们:

                 term freq
1                   a  131
2                 a a   13
3            a a card    3
4       a a card base    2
5    a a card base ne    1
6         a a divorce    1
7  a a divorce lawyer    1
8                  be   72
9              be the   17
10         be the one    5
Run Code Online (Sandbox Code Playgroud)

我想要的是将unigrams(只有一个单词的术语),bigrams(含有两个单词的术语),trigrams,fourgrams和fivegrams分成不同的数据框:

例如,仅包含unigrams的"df1"将如下所示:

                 term freq
1                   a  131
2                  be   72
Run Code Online (Sandbox Code Playgroud)

"df2"(双子座):

                 term freq
1                 a a   13
2              be the   17
Run Code Online (Sandbox Code Playgroud)

"df3"(三卦):

                 term freq
1            a a card    3
2         a a divorce    1
3          be the one    5
Run Code Online (Sandbox Code Playgroud)

等等.任何的想法?正则表达式可能吗?

Sot*_*tos 6

您可以按空间计数进行拆分,即

split(df, stringr::str_count(df$term, '\\s+'))

#$`0`
#  term freq
#1    a  131
#8   be   72

#$`1`
#    term freq
#2    a a   13
#9 be the   17

#$`2`
#          term freq
#3     a a card    3
#6  a a divorce    1
#10  be the one    5

#$`3`
#                term freq
#4      a a card base    2
#7 a a divorce lawyer    1

#$`4`
#              term freq
#5 a a card base ne    1
Run Code Online (Sandbox Code Playgroud)

一个单独的基础R解决方案(如@akrun提到),将是,

split(df, lengths(gregexpr("\\S+", df$term)))
Run Code Online (Sandbox Code Playgroud)

  • 这是另一个用`base R`即`split(df,lengths(gregexpr("\\ S +",df $ term))) (3认同)