如何从R中只包含少量单词的数据框中删除行?

cpt*_*ptn 1 grep row r dataframe tm

我正在尝试从包含少于5个单词的数据框中删除行.例如

mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE)

head(mydf)

     NO    ARTICLE
1    34    The New York Times reports a lot of words here.
2    12    Greenwire reports a lot of words.
3    31    Only three words.
4     2    The Financial Times reports a lot of words.
5     9    Greenwire short.
6    13    The New York Times reports a lot of words again.
Run Code Online (Sandbox Code Playgroud)

我想删除5个或更少单词的行.我怎样才能做到这一点?

jlh*_*ard 5

这有两种方式:

mydf[sapply(gregexpr("\\W+", mydf$ARTICLE), length) >4,]
#   NO                                          ARTICLE
# 1 34  The New York Times reports a lot of words here.
# 2 12                Greenwire reports a lot of words.
# 4  2      The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.


mydf[sapply(strsplit(as.character(mydf$ARTICLE)," "),length)>5,]
#   NO                                          ARTICLE
# 1 34  The New York Times reports a lot of words here.
# 2 12                Greenwire reports a lot of words.
# 4  2      The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.
Run Code Online (Sandbox Code Playgroud)

第一个生成包含第一个之后每个单词的起始位置的向量,然后计算该向量的长度.

第二个将ARTICLE列拆分为包含组成单词的向量,并计算该向量的长度.这可能是一种更好的方法.