cpt*_*ptn 1 grep row r dataframe tm
我正在尝试从包含少于5个单词的数据框中删除行.例如
mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE)
head(mydf)
NO ARTICLE
1 34 The New York Times reports a lot of words here.
2 12 Greenwire reports a lot of words.
3 31 Only three words.
4 2 The Financial Times reports a lot of words.
5 9 Greenwire short.
6 13 The New York Times reports a lot of words again.
Run Code Online (Sandbox Code Playgroud)
我想删除5个或更少单词的行.我怎样才能做到这一点?
这有两种方式:
mydf[sapply(gregexpr("\\W+", mydf$ARTICLE), length) >4,]
# NO ARTICLE
# 1 34 The New York Times reports a lot of words here.
# 2 12 Greenwire reports a lot of words.
# 4 2 The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.
mydf[sapply(strsplit(as.character(mydf$ARTICLE)," "),length)>5,]
# NO ARTICLE
# 1 34 The New York Times reports a lot of words here.
# 2 12 Greenwire reports a lot of words.
# 4 2 The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.
Run Code Online (Sandbox Code Playgroud)
第一个生成包含第一个之后每个单词的起始位置的向量,然后计算该向量的长度.
第二个将ARTICLE列拆分为包含组成单词的向量,并计算该向量的长度.这可能是一种更好的方法.
归档时间: |
|
查看次数: |
671 次 |
最近记录: |