从文本中提取多个关键字并在数据框中打印

R o*_*low 2 regex text r dplyr

我有一个数据框(称为all_data),如下所示:

Title         Text 
Title_1       Very interesting word_1 and also keyword_2
Title_2       hello keyword_1, and keyword_3. 
Run Code Online (Sandbox Code Playgroud)

我还有第二个数据框(称为keywords),如下所示:

keywords
word_1
word_2
word_3
word_4a word_4b word_4c
Run Code Online (Sandbox Code Playgroud)

我想在 all_data 数据框中创建一个额外的列。在本列中,如果某个关键字(来自关键字数据帧)出现在 all_data$Text 或 all_data$Title 列中,我想打印相关关键字。例如:

Title         Text                                               Keywords
Title_1       Very interesting word_1 and also word_2, word_1.   word_1, word_2
Title_2       hello word_1, and word_3.                          word_1, word_3
Title_3       difficult! word_4b, and word_4a also word_4c       word_4a word_4b word_4c
Run Code Online (Sandbox Code Playgroud)

!只需在 all_data$Words 列中打印一次单词,而不是多次。对我来说,更难的部分是打印一个“关键字”,例如:“keyword_A Keyword_A1 Keyword_A3”,只有当关键字的所有部分都出现在相关文本中时才会出现。

这个问题在这里得到了回答(识别列中的模式,并将它们添加到数据框中的列),我在其中使用了 DJack 他的解决方案:

ls <- strsplit(tolower(paste(all_data$Title, all_data$Text)),"(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)    

all_data$Keywords <- do.call("rbind",lapply(ls,function(x) paste(unique(x[x %in% tolower(keywords)]), collapse = ", ")))
Run Code Online (Sandbox Code Playgroud)

但是,当出现多个关键字时,它会失败(如果您有这样的文本,则应该出现像“老奶奶”这样的关键字:“嘿,你的奶奶很好,而且很老”。

更新

@Nicolas2 帮助我找到了解决方案(谢谢)。但不幸的是它失败了。有人知道如何解决这个问题吗?正如您在下面的示例中看到的,关键字“feyenoord Skin”不应出现(因为文本中未出现“skin”)。我只希望关键字出现在文本中(或者有多个关键字,例如“Hello World”,如果所有单词都出现在文本中,那就太好了(所以 Hello 和 World)。非常感谢!

df <- data.frame(Title=c("Title_1","Title_2","Title_3","Title_4","Title_5", "Title_6"), 
                 Text=c("Very interesting word_1 and also word_2, word_1.", 
                        "hello word_1, and word_3.", 
                        "difficult! word_4b, and word_4a also word_4c", 
                        "A bit of word_1, some word_4a, and mostly word_3", 
                        "nothing interesting here", 
                        "Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one help(430) to measure feyenoord or feyenoord components and to determine a feyenoord sampling bmw. The word car is rstudio, at least in part, using the feyenoord sampling bmw. The feyenoord sampling bmw may be rstudio, at least in part, using a feyenoord volume (640) and/or a feyenoord generation bmw, both of which may be python or prerstudio."), 
                 stringsAsFactors=F) 


keywords<-data.frame(Keyword=c("word_1","word_2","word_3","word_4a word_4b word_4c", 
                               "a feyenoord sense", 
                               "feyenoord", "feyenoord feyenoord", "feyenoord skin", "feyenoord collection", 
                               "skin feyenoord", "feyenoord collector", "feyenoord bmw", 
                               "collection feyenoord", "concentration feyenoord", "feyenoord sample",
                               "feyenoord stimulation", "analyte feyenoord", "collect feyenoord", 
                               "feyenoord collect", "pathway feyenoord feyenoord sandboxs", 
                               "feyenoord bmw mouses", "sandbox", "bmw", 
                               "pulse bmw three levels"),stringsAsFactors=F) 

# split the keywords into words, but remember keyword length 
k <- keywords %>% mutate(l=str_split(Keyword," ")) %>% unnest %>% 
  group_by(Keyword) %>% mutate(n=n()) %>% ungroup 
# split the title into words 
# compare with words from keywords 
# keep only possibly multiple, but full matches 
# collate all results and merge back to the original data 
test <- df %>% mutate(l=str_split(Text,"[ .,]")) %>% unnest %>% 
  inner_join(k,by="l") %>% 
  group_by(Title,Keyword) %>% filter(n()%%n==0) %>% 
  distinct(Keyword) %>% ungroup %>% nest(Keyword) %>% 
  rowwise %>% mutate(keywords=paste(data[[1]],collapse=", ")) %>% select(-data) %>% 
  inner_join(df,.,by="Title") 

View(test)
Run Code Online (Sandbox Code Playgroud)

s_t*_*s_t 5

如果关键字仅由一个单词组成,例如“老奶奶”可以由两个关键字“老”和“奶奶”组成,那么使用一个非常适合文本分析的包的解决方案怎么样tidytext

library(dplyr)     
library(tidytext)  # text manipulation
Run Code Online (Sandbox Code Playgroud)

首先,我们必须将数据设置为每个单词都是一行,因此我们以这种方式拆分 all_data 和关键字:

all_data_un <- all_data %>% unnest_tokens(word,Text)
    > all_data_un
       Title        word
1    Title_1        very
1.1  Title_1 interesting
1.2  Title_1      word_1
1.3  Title_1         and
1.4  Title_1        also
1.5  Title_1      word_2
1.6  Title_1      word_1
2    Title_2       hello
2.1  Title_2      word_1
2.2  Title_2         and
2.3  Title_2      word_3
3    Title_3   difficult
3.1  Title_3     word_4b
3.2  Title_3         and
3.3  Title_3     word_4a
3.4  Title_3        also
....

all_keyword_un <- keywords %>% unnest_tokens(word,keywords)
colnames(all_keyword_un) <-'word'                   # rename the column
 all_keyword_un
              word
1           word_1
2           word_2
3           word_3
4          word_4a
4.1        word_4b
4.2        word_4c
5                a
5.1      feyenoord
5.2          sense
6        feyenoord
7        feyenoord
7.1      feyenoord
8        feyenoord
8.1           skin
9        feyenoord
9.1     collection
10            skin
10.1     feyenoord
11       feyenoord
11.1     collector
12       feyenoord
12.1           bmw
13      collection
13.1     feyenoord
....
Run Code Online (Sandbox Code Playgroud)

如您所见,unnest_tokens()如有必要,会删除标点符号和大写字母。

现在可以只过滤关键字中的单词:

all_data_un_fi <- all_data_un[all_data_un$word %in% all_keyword_un$word,]
      > all_data_un_fi
       Title      word
1.2  Title_1    word_1
1.5  Title_1    word_2
1.6  Title_1    word_1
2.1  Title_2    word_1
2.3  Title_2    word_3
3.1  Title_3   word_4b
3.3  Title_3   word_4a
3.5  Title_3   word_4c
4    Title_4         a
4.3  Title_4    word_1
4.5  Title_4   word_4a
4.8  Title_4    word_3
6.2  Title_6     sense 
....
Run Code Online (Sandbox Code Playgroud)

最后一步:合并数据集和每个句子中找到的关键字:

all_data %>%                                      # starting data
left_join(all_data_un_fi) %>%                     # joining without forget any sentence
group_by(Title,Text) %>%                          # group by title and text
summarise(keywords = paste(word, collapse =','))  # put in one cell all the keywords finded


   Joining, by = "Title"
# A tibble: 6 x 3
# Groups:   Title [?]
  Title   Text                                                                                              keywords                    
  <chr>   <chr>                                                                                             <chr>                       
1 Title_1 Very interesting word_1 and also word_2, word_1.                                                  word_1,word_2,word_1        
2 Title_2 hello word_1, and word_3.                                                                         word_1,word_3               
3 Title_3 difficult! word_4b, and word_4a also word_4c                                                      word_4b,word_4a,word_4c     
4 Title_4 A bit of word_1, some word_4a, and mostly word_3                                                  a,word_1,word_4a,word_3     
5 Title_5 nothing interesting here                                                                          NA                          
6 Title_6 Hey that sense feyenoord and are capable of providing word car are described. The text (800) use~ sense,feyenoord,feyenoord,f~
Run Code Online (Sandbox Code Playgroud)

对于由一个或多个单词组成的关键字,因此“老奶奶”的关键字是“老奶奶”,您可以执行以下操作:

library(stringr)
library(dplyr)
Run Code Online (Sandbox Code Playgroud)

首先是一个空列表:

mylist <- list()
Run Code Online (Sandbox Code Playgroud)

然后你可以用一个循环填充它,对于每个关键字,找到包含该关键字的句子:

for (i in keywords$keywords) {
keyworded <- all_data %>%filter(str_detect(Text, i)) %>% mutate(keyword = i)
  mylist[[i]] <- keyworded}
Run Code Online (Sandbox Code Playgroud)

将其放入 data.frame 中:

 df <- do.call("rbind",mylist)%>%data.frame()
Run Code Online (Sandbox Code Playgroud)

然后按每个关键字分组:

 df %>% group_by(Title,Text) %>% summarise(keywords = paste(keyword,collapse=','))

# A tibble: 4 x 3
# Groups:   Title [?]
  Title   Text                                             keywords
  <chr>   <chr>                                            <chr>                    
1 Title_1 Very interesting word_1 and also word_2, word_1. word_1,word_2            
2 Title_2 hello word_1, and word_3.                        word_1,word_3            
3 Title_4 A bit of word_1, some word_4a, and mostly word_3 word_1,word_3            
4 Title_6 Hey that sense feyenoord and are capable of pro~ feyenoord,bmw,sense feye~
Run Code Online (Sandbox Code Playgroud)

注意:重复的内容会像第一句中那样被删除,并且word_4a不在其中,因为在关键字中,您仅将其与其他单词一起放在字符串中。


使用数据(请注意,我已经修改了键,添加了“sense feyenoord”来测试 末尾两个单词的关键字keywords):

   all_data <-  data.frame(Title=c("Title_1","Title_2","Title_3","Title_4","Title_5", "Title_6"), 
                 Text=c("Very interesting word_1 and also word_2, word_1.", 
                        "hello word_1, and word_3.", 
                        "difficult! word_4b, and word_4a also word_4c", 
                        "A bit of word_1, some word_4a, and mostly word_3", 
                        "nothing interesting here", 
                        "Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one help(430) to measure feyenoord or feyenoord components and to determine a feyenoord sampling bmw. The word car is rstudio, at least in part, using the feyenoord sampling bmw. The feyenoord sampling bmw may be rstudio, at least in part, using a feyenoord volume (640) and/or a feyenoord generation bmw, both of which may be python or prerstudio."), 
                 stringsAsFactors=F) 

keywords<-data.frame(keywords = c("word_1","word_2","word_3","word_4a word_4b word_4c", 
                               "a feyenoord sense", 
                               "feyenoord", "feyenoord feyenoord", "feyenoord skin", "feyenoord collection", 
                               "skin feyenoord", "feyenoord collector", "feyenoord bmw", 
                               "collection feyenoord", "concentration feyenoord", "feyenoord sample",
                               "feyenoord stimulation", "analyte feyenoord", "collect feyenoord", 
                               "feyenoord collect", "pathway feyenoord feyenoord sandboxs", 
                               "feyenoord bmw mouses", "sandbox", "bmw", 
                               "pulse bmw three levels","sense feyenoord"), stringsAsFactors=F)
Run Code Online (Sandbox Code Playgroud)

您还可以混合这两种方式,获得两种结果,然后折叠在一起或创建它们的组合。


编辑:
要将它们合并在一起,您有很多方法,一个简单的方法就是这样,它也提供了唯一性:

# first we create all the "single" keywords, i e "old grandma" -> "old" and "grandma"
all_keyword_un <- keywords %>% unnest_tokens(word,keywords)
colnames(all_keyword_un) <-'keywords'                   # rename the column

# then you bind them to the full keywords, i.e. "old" "grandma" and "old grandma" together
keywords <- rbind(keywords, all_keyword_un)

# lastly the second way for each keyword
mylist <- list()
for (i in keywords$keywords) {
  keyworded <- all_data %>%filter(str_detect(Text, i)) %>% mutate(keyword = i)
  mylist[[i]] <- keyworded}

df <- do.call("rbind",mylist)%>%data.frame()
df <- df %>% group_by(Title,Text) %>% summarise(keywords = paste(keyword,collapse=','))

# A tibble: 5 x 3
# Groups:   Title [?]
  Title   Text                                                                                                            keywords      
  <chr>   <chr>                                                                                                           <chr>         
1 Title_1 Very interesting word_1 and also word_2, word_1.                                                                word_1,word_2~
2 Title_2 hello word_1, and word_3.                                                                                       word_1,word_3~
3 Title_3 difficult! word_4b, and word_4a also word_4c                                                                    word_4a,word_~
4 Title_4 A bit of word_1, some word_4a, and mostly word_3                                                                word_1,word_3~
5 Title_6 Hey that sense feyenoord and are capable of providing word car are described. The text (800) uses at least one~ feyenoord,bmw~
Run Code Online (Sandbox Code Playgroud)