文本挖掘 - 如果内容位于另一个单元格中,则删除内容

jak*_*der 2 nlp r

我有一个包含单元格中文本的大型数据集.有些文本只是之前的附加单元格,除非日期不同,否则我不想在我的分析中包含它.这是它的样子的一个例子:

  1. 10-01-17 | 你好你好吗?
  2. 10-01-17 | 你好你好吗?哦,我很好.
  3. 11-01-17 | 你好你好吗?哦,我很好.今天天气很好.

如果1在2中,如果日期相同,我想删除1.如果2在3中,则仅在日期相同时删除2.我想要保留的唯一内容是两个和三个.

ali*_*ire 5

您可以grepl在整个列上使用每个观察作为模式.如果得到的布尔向量的总和大于1,则该行匹配的比自身多,并且是重复的.

df[mapply(function(d, t) {
    sum(grepl(t, df$text, fixed = TRUE) & d == df$date) == 1
}, df$date, df$text), ]

##       date                                                            text
## 2 10-01-17                             Hi, how are you? Oh, I'm just fine.
## 3 11-01-17  Hi, how are you? Oh, I'm just fine. The weather is nice today.
Run Code Online (Sandbox Code Playgroud)

或者在dplyr中,

library(dplyr)

df %>% rowwise() %>% filter(sum(grepl(text, .$text, fixed = TRUE) & date == .$date) == 1)

## Source: local data frame [2 x 2]
## Groups: <by row>
## 
## # A tibble: 2 × 2
##       date                                                            text
##      <chr>                                                           <chr>
## 1 10-01-17                             Hi, how are you? Oh, I'm just fine.
## 2 11-01-17  Hi, how are you? Oh, I'm just fine. The weather is nice today.
Run Code Online (Sandbox Code Playgroud)

数据

df <- structure(list(date = c("10-01-17", "10-01-17", "11-01-17"
    ), text = c("Hi, how are you?", "Hi, how are you? Oh, I'm just fine.", 
    "Hi, how are you? Oh, I'm just fine. The weather is nice today."
    )), class = "data.frame", row.names = c(NA, -3L), .Names = c("date", "text"))
Run Code Online (Sandbox Code Playgroud)