Ben*_*amp 2 r text-mining tidytext
嗨,我正在使用tidy_text格式,我试图将字符串"电子邮件"和"电子邮件"替换为"电子邮件".
set.seed(123)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
df <- data.frame(sentence = sample(terms, 100, replace = TRUE))
df
str(df)
df$sentence <- as.character(df$sentence)
tidy_df <- df %>%
unnest_tokens(word, sentence)
tidy_df %>%
count(word, sort = TRUE) %>%
filter( n > 20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
Run Code Online (Sandbox Code Playgroud)
这工作正常,但当我使用:
tidy_df <- gsub("emailing", "email", tidy_df)
Run Code Online (Sandbox Code Playgroud)
替换单词并再次运行条形图我收到以下错误消息:
UseMethod("group_by_")中的错误:没有适用于"group_by_"的方法应用于类"character"的对象
有没有人知道如何在不改变tidy_text的结构/类的情况下,在整洁的文本格式中轻松替换单词?
Jul*_*lge 10
删除像这样的单词的结尾称为词干,如果你愿意,R中有几个包可以为你做这些.一个是来自rOpenSci的hunspell包,另一个选项是实现Porter算法干扰的SnowballC包.你会这样实现:
library(dplyr)
library(tidytext)
library(SnowballC)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = wordStem(word))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 i
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 i
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
Run Code Online (Sandbox Code Playgroud)
请注意,它会阻止您的所有文本,并且某些单词不再像真正的单词; 你可能会或可能不会关心这一点.
如果你不希望使用干像SnowballC或一个的hunspell词干的所有文字,你可以使用dplyr是if_else内mutate()只更换特定的单词.
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = if_else(word %in% c("emailing", "emails"), "email", word))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 is
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 is
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
Run Code Online (Sandbox Code Playgroud)
或者str_replace从stringr包中使用它可能更有意义.
library(stringr)
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = str_replace(word, "email(s|ing)", "email"))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 is
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 is
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2636 次 |
| 最近记录: |