根据重叠模式删除部分字符串

Question

根据重叠模式删除部分字符串

我有以下数据：

dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short"),
                  some_other_cols = c(1, 2, 2))

Run Code Online (Sandbox Code Playgroud)

此外，我有以下模式向量：

my_patterns <- c("my example", "is my", "my other text")

Run Code Online (Sandbox Code Playgroud)

我想实现的是，以消除任何文本my_patterns发生在dat$x。

我尝试了下面的解决方案，但问题是，一旦我从文本中删除第一个模式（此处：“我的示例”），我的解决方案就无法检测到第二个模式的出现（此处：“是我的”） ) 或第三种模式了。

错误的解决方法：

library(tidyverse)
my_patterns_c <- str_c(my_patterns, collapse = "|")

dat_new <- dat %>%
  mutate(short_x = str_replace_all(x, pattern = my_patterns_c, replacement = ""))

Run Code Online (Sandbox Code Playgroud)

我想我可以做某事。就像遍历所有模式一样，收集 dat$x 中与我的模式匹配的字符串位置，然后将它们组合成一个范围并从文本中删除该范围。例如，我将列添加到我dat喜欢的数据帧start_pattern_1和end_pattern_1等。因此，对于第一行 1，我得到第一个模式的 9（开始）和 18（结束），第二个模式的 6/10。然后我需要检查是否有任何end位置与任何位置重叠start（这里是从 9 开始到 10 结束）并将它们组合到 6-18 范围内，然后从文本中删除这个范围。

问题是我可能有很多新的开始/结束列（在我的情况下可能是几百个模式），如果我需要成对比较重叠范围，我的计算机可能会崩溃。

所以我想知道如何让它工作或者我应该如何最好地处理这个解决方案。也许（我希望如此）有一个更好/更优雅/更简单的解决方案。

所需的输出dat将是：

x                                    some_other_cols    short_x
this is my example text              1                  this text
and here is my other text example    2                  and here example
my other text is short               2                  is short

Run Code Online (Sandbox Code Playgroud)

感谢你的帮助！谢谢。

Answer 1

Ten*_*bai 6

Uwe 在问题下的评论中提到了带有 str_locate_all 的新选项，这大大简化了代码：

library(stringr)
# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  if (nrow(positions) == 0) return(text)
  ret <- strsplit(text,"")[[1]]
  lapply(1:nrow(positions), function(x) { ret[ positions[x,1]:positions[x,2] ] <<- NA } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Loop over the data to apply the pattern
# row = length of vector, columns = length of pattern
matches <- lapply(dat$x, function(x) {
  do.call(rbind,str_locate_all(x, my_patterns)) # transform the list output of str_locate in a table of start/end
})

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[[i]])
}

Run Code Online (Sandbox Code Playgroud)

如果您可以控制模式定义并且可以手动创建它，那么可以使用正则表达式解决方案来实现：

> gsub("(is )?my (other text|example)?","",dat$x)
[1] "this  text"        "and here  example" " is short"

Run Code Online (Sandbox Code Playgroud)

这个想法是创建带有可选部分的模式（?在分组括号之后。

所以我们大致有：

(is )? <= 可选的“是”后跟空格
my <= 文字“我的”后跟空格
(other text|example)?<=“我的”之后的可选文本，“其他文本”或（|）“示例”

如果您没有控制权，事情会变得一团糟，我希望我已经发表了足够多的评论以使其易于理解，根据包含的循环数量不要期望它很快：

# Given datas
dat <- data.frame(x               = c("this is my example text", "and here is my other text example", "my other text is short","yet another text"),
                some_other_cols = c(1, 2, 2, 4))

my_patterns <- c("my example", "is my", "my other text")

# Create function to remove matching part of text
# First argument is text, second argument is a list of start and length 
remove_matching_parts <- function(text, positions) {
  ret <- strsplit(text,"")[[1]]
  lapply(positions, function(x) { ifelse(is.na(x),,ret[ x[1]:x[2] ] <<- NA ) } )
  paste0(ret[!is.na(ret)],separator="",collapse="")
}

# Create the matches between a vector and a pattern
# First argument is the pattern to match, second is the vector of charcaters
match_pat_to_vector <- function(pattern,vector) {
  sapply(regexec(pattern,vector), 
         function(x) {
           if(x>-1) { 
             c(start=as.numeric(x), end=as.numeric(x+attr(x,"match.length")) ) # Create a start/end vector from the index and length of the match
           }
         })
}

# Loop over the patterns to create a dataframe of matches
# row = length of vector, columns = length of pattern
matches <- sapply(my_patterns,match_pat_to_vector,vector=dat$x)

# Avoid growing a vector in a for loop, create it beforehand, it will be the same length as teh vector we work against
dat$result <- vector("character",length(dat$x))
# Loop on each value to remove the matching parts
for (i in 1:length(dat$x)) {
 dat$result[i] <- remove_matching_parts(as.character(dat$x[i]),matches[i,])
}

Run Code Online (Sandbox Code Playgroud)

运行后结果：

> dat
                                  x some_other_cols           result
1           this is my example text               1        this text
2 and here is my other text example               2 and here example
3            my other text is short               2         is short
4                  yet another text               4 yet another text

Run Code Online (Sandbox Code Playgroud)

Answer 2

Uwe*_*Uwe 6

这里有两个关键点：

从字符串中删除的模式可能会重叠
可能有多个非重叠模式要从字符串中删除

下面的解决方案尝试使用我最喜欢的工具解决这两个问题

library(data.table)
setDT(dat)[, rn := .I] # add row numbers to join on later

library(stringr)
library(magrittr) # piping used to improve readability

pos <- 
  # find start and end positions for each pattern
  lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>% 
           lapply(as.data.table) %>% 
           rbindlist(idcol = "rn")) %>% 
  rbindlist() %>% 
  # collapse overlapping positions
  setorder(rn, start, end) %>% 
  .[, grp := cumsum(cummax(shift(end, fill = 0)) < start), by = rn] %>% 
  .[, .(start = min(start), end = max(end)), by = .(rn, grp)]

Run Code Online (Sandbox Code Playgroud)

现在，pos变成了：

    rn grp start end
 1:  1   1     6  18
 2:  2   1    10  25
 3:  3   1     1  13
 4:  5   1     6  10
 5:  5   2    24  28
 6:  6   1     1  13
 7:  6   2    15  27
 8:  7   1     3   7
 9:  8   1     1  10
10:  8   2    12  16
11:  8   3    22  34
12:  9   1     1  10
13:  9   2    19  31

Run Code Online (Sandbox Code Playgroud)

# remove patterns from strings from back to front
dat[, short_x := x]
for (g in rev(seq_len(max(pos$grp)))) {
  # update join 
  dat[pos[grp == g], on = .(rn), short_x := `str_sub<-`(short_x, start, end, value = "")]
}
dat[, rn := NULL][   #remove row number
  , short_x := str_squish(short_x)][]   # remove whitespace

Run Code Online (Sandbox Code Playgroud)

                                             x some_other_cols                          short_x
1:                     this is my example text               1                        this text
2:           and here is my other text example               2                 and here example
3:                      my other text is short               2                         is short
4:                            yet another text               4                 yet another text
5: this is my text where 'is my' appears twice               5 this text where '' appears twice
6:                 my other text is my example               6                                 
7:                                 This myself               7                           Thself
8:          my example is my not my other text               8                              not
9:             my example is not my other text               9                           is not

Run Code Online (Sandbox Code Playgroud)

从这个答案修改了折叠重叠位置的代码。

中间结果

lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>% 
           lapply(as.data.table) %>% 
           rbindlist(idcol = "rn"))

Run Code Online (Sandbox Code Playgroud)

[[1]]
   rn start end
1:  1     9  18
2:  6    18  27
3:  8     1  10
4:  9     1  10

[[2]]
   rn start end
1:  1     6  10
2:  2    10  14
3:  5     6  10
4:  5    24  28
5:  6    15  19
6:  7     3   7
7:  8    12  16

[[3]]
   rn start end
1:  2    13  25
2:  3     1  13
3:  6     1  13
4:  8    22  34
5:  9    19  31

Run Code Online (Sandbox Code Playgroud)

显示模式 1 和 2 在第 1 行重叠，模式 2 和 3 在第 2 行重叠。第 5、8 和 9 行具有非重叠模式。第 7 行是为了显示模式的提取与单词边界无关。

编辑：`dplyr`版本

OP提到他/她“到目前为止成功避免了 data.table ”。所以，我觉得很难添加一个dplyr版本：

library(dplyr)
library(stringr)

pos <- 
  # find start end end positions for each pattern
  lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>% 
           lapply(as_tibble) %>% 
           bind_rows(.id = "rn")) %>% 
  bind_rows() %>% 
  # collapse overlapping positions
  arrange(rn, start, end) %>% 
  group_by(rn) %>% 
  mutate(grp = cumsum(cummax(lag(end, default = 0)) < start)) %>% 
  group_by(rn, grp) %>% 
  summarize(start = min(start), end = max(end))
# remove patterns from strings from back to front
dat <- dat %>%
  mutate(rn = row_number() %>% as.character(),
         short_x = x %>% as.character())
for (g in rev(seq_len(max(pos$grp)))) {
  dat <- dat %>% 
    left_join(pos %>% filter(grp == g), by = "rn") %>% 
    mutate(short_x = ifelse(is.na(grp), short_x, `str_sub<-`(short_x, start, end, value = ""))) %>% 
    select(-grp, -start, -end)
}
# remove row number
dat %>% 
  select(-rn) %>% 
  mutate(short_x = str_squish(short_x))

Run Code Online (Sandbox Code Playgroud)

                                            x some_other_cols                          short_x
1                     this is my example text               1                        this text
2           and here is my other text example               2                 and here example
3                      my other text is short               2                         is short
4                            yet another text               4                 yet another text
5 this is my text where 'is my' appears twice               5 this text where '' appears twice
6                 my other text is my example               6                                 
7                              This is myself               7                        This self
8          my example is my not my other text               8                              not
9             my example is not my other text               9                           is not

Run Code Online (Sandbox Code Playgroud)

算法本质上是一样的。但是，这里有两个挑战dplyr不同于data.table：

dplyr需要从factor到的显式强制转换character
中没有可用的更新连接dplyr，因此for循环变得比data.table对应的循环更冗长（也许有人知道一个花哨的purrr函数或map-reduce技巧来完成相同的任务？）

编辑 2

以上代码有一些错误修复和改进：

折叠位置已得到纠正，也适用于我添加到dat.
seq()已被取代seq_len()。
str_squish() 减少字符串内的重复空格并从字符串的开头和结尾删除空格。

数据

我添加了一些用例来测试非重叠模式和完全删除，例如：

dat <- data.frame(
  x = c(
    "this is my example text",
    "and here is my other text example",
    "my other text is short",
    "yet another text",
    "this is my text where 'is my' appears twice",
    "my other text is my example",
    "This myself",
    "my example is my not my other text",
    "my example is not my other text"
  ),
  some_other_cols = c(1, 2, 2, 4, 5, 6, 7, 8, 9)
)
my_patterns <- c("my example", "is my", "my other text")

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年前
查看次数：	780 次
最近记录：	6 年前

根据重叠模式删除部分字符串

编辑：dplyr版本

编辑 2

数据

编辑：`dplyr`版本