有一个带文本的数据框
df = data.frame(id=c(1,2), text = c("My best friend John works and Google", "However he would like to work at Amazon as he likes to use python and stay at Canada")
Run Code Online (Sandbox Code Playgroud)
无需任何预处理
怎么可能像这样提取名称实体识别
示例结果词
dfresults = data.frame(id=c(1,2), ner_words = c("John, Google", "Amazon, python, Canada")
Run Code Online (Sandbox Code Playgroud)
您可以在没有quanteda 的情况下使用spacyr包(链接文章中提到的 spaCy 库的包装器)来执行此操作。
在这里,我稍微编辑了您的输入 data.frame。
df <- data.frame(id = c(1, 2),
text = c("My best friend John works at Google.",
"However he would like to work at Amazon as he likes to use Python and stay in Canada."),
stringsAsFactors = FALSE)
Run Code Online (Sandbox Code Playgroud)
然后:
library("spacyr")
library("dplyr")
# -- need to do these before the next function will work:
# spacy_install()
# spacy_download_langmodel(model = "en_core_web_lg")
spacy_initialize(model = "en_core_web_lg")
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 2.0.10, language model: en_core_web_lg)
#> (python options: type = "condaenv", value = "spacy_condaenv")
txt <- df$text
names(txt) <- df$id
spacy_parse(txt, lemma = FALSE, entity = TRUE) %>%
entity_extract() %>%
group_by(doc_id) %>%
summarize(ner_words = paste(entity, collapse = ", "))
#> # A tibble: 2 x 2
#> doc_id ner_words
#> <chr> <chr>
#> 1 1 John, Google
#> 2 2 Amazon, Python, Canada
Run Code Online (Sandbox Code Playgroud)