从数据框中删除停用词

Question

从数据框中删除停用词

我的数据已经在一个数据框中，每行一个令牌。我想过滤掉包含停用词的行。

数据框看起来像：

docID <- c(1,2,2)
token <- c('the', 'cat', 'sat')
count <- c(10,20,30)
df <- data.frame(docID, token, count)

Run Code Online (Sandbox Code Playgroud)

我试过下面的，但得到一个错误：

library(tidyverse)
library(tidytext)
library(topicmodels)
library(stringr)
data('stop_words')
clean_df <- df %>%
  anti_join(stop_words, by=df$token)

Run Code Online (Sandbox Code Playgroud)

错误：

Error: `by` can't contain join column `the`, `cat`, `sat` which is missing from LHS

Run Code Online (Sandbox Code Playgroud)

我该如何解决这个问题？

Answer 1

Jul*_*lge 8

设置时anti_join()，您需要在左侧和右侧说明列名称。在stop_wordstidytext的数据对象中，该列被称为，word而在您的数据框中，它被称为token。

library(tidyverse)
library(tidytext)

docID <- c(1, 2, 2, 2, 3)
token <- c("the", "cat", "sat", "on-the-mat", "with3hats")
count <- c(10, 20, 30, 10, 20)
df <- data_frame(docID, token, count)


clean_df <- df %>%
  anti_join(stop_words, by= c("token" = "word"))

clean_df
#> # A tibble: 4 x 3
#>   docID token      count
#>   <dbl> <chr>      <dbl>
#> 1  2.00 cat         20.0
#> 2  2.00 sat         30.0
#> 3  2.00 on-the-mat  10.0
#> 4  3.00 with3hats   20.0

Run Code Online (Sandbox Code Playgroud)

请注意，“the”现在消失了，因为它在stop_words数据集中。

在评论中，您询问了删除包含标点符号或数字的标记。我会用filter()这个（filter()如果你愿意，你实际上也可以用来删除停用词。）

clean_df <- df %>%
  filter(!str_detect(token, "[:punct:]|[:digit:]"))

clean_df
#> # A tibble: 3 x 3
#>   docID token count
#>   <dbl> <chr> <dbl>
#> 1  1.00 the    10.0
#> 2  2.00 cat    20.0
#> 3  2.00 sat    30.0

Run Code Online (Sandbox Code Playgroud)

如果您想同时执行这两项操作，请使用管道用两条线构建您的对象。

归档时间：	7 年，11 月前
查看次数：	6952 次
最近记录：	7 年，11 月前