R中最长的公共子串查找两个字符串之间的非连续匹配

IAM*_*bby 14 r lcs

我有一个关于在R中找到最长公共子字符串的问题.在StackOverflow上搜索几个帖子时,我了解了qualV包.但是,我看到这个包中的LCS函数实际上找到了string1中存在的所有字符,即使它们不是连续的.

为了解释,如果字符串字符串1:" HEL LO"字符串2:" HEL 12345lo"我希望可以将输出为HEL,但是我得到的输出为hello.我一定做错了什么.请参阅下面的代码.

library(qualV)
a= "hello"
b="hel123l5678o" 
sapply(seq_along(a), function(i)
    paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
              substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
          collapse = ""))

Run Code Online (Sandbox Code Playgroud)

我也尝试了Rlibstree方法,但我仍然得到不连续的子串.此外,子串的长度也与我的预期不同.请参阅下文.

> a = "hello"
> b = "h1e2l3l4o5"

> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"

> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
                21

Run Code Online (Sandbox Code Playgroud)

利用adist可以使用的@RichardScriven 的洞察力（它计算“近似字符串距离”。我做了一个更全面的函数。请注意"trafos"代表用于确定两个字符串之间“距离”的“转换”（底部示例）

编辑此答案可能会产生错误/意外的结果；正如@wdkrnls 所指出的：

我针对“apple”和“big apple bagels”运行了你的函数，它返回了“appl”。我会期待“苹果”。

有关错误结果，请参阅下面的说明。我们从一个获取longest_string列表的函数开始：

longest_string <- function(s){return(s[which.max(nchar(s))])}

Run Code Online (Sandbox Code Playgroud)

然后我们可以使用@RichardSriven 的工作和stringi库：

library(stringi)
lcsbstr <- function(a,b) { 
  sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
  cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
  longest_cmn_sbstr <- longest_string(cmn_sbstr)
   return(longest_cmn_sbstr) 
}

Run Code Online (Sandbox Code Playgroud)

或者我们可以重写我们的代码以避免使用任何外部库（仍然使用 R 的本机adist函数）：

lcsbstr_no_lib <- function(a,b) { 
    matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
    lengths<- attr(matches, 'match.length')
    which_longest <- which.max(lengths)
    index_longest <- matches[which_longest]
    length_longest <- lengths[which_longest]
    longest_cmn_sbstr  <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
    return(longest_cmn_sbstr ) 
}

Run Code Online (Sandbox Code Playgroud)

上述两个函数仅标识'hello '为最长的公共子字符串，而不是 ' hello r'（无论哪个参数是两者中较长的）：

identical('hello',
    lcsbstr_no_lib('hello', 'hello there'), 
    lcsbstr(       'hello', 'hello there'),
    lcsbstr_no_lib('hello there', 'hello'), 
    lcsbstr(       'hello there', 'hello'))

Run Code Online (Sandbox Code Playgroud)

上次编辑 请注意以下结果的一些奇怪行为：

lcsbstr('hello world', 'hello')
#[1] 'hell'

Run Code Online (Sandbox Code Playgroud)

我期待'hello'，但由于转型实际上移动（通过删除）的“O”以W Ø RLD成为地狱“O” Ø -只有地狱部分被认为根据比赛M：

drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1]  vvvv   v
#[1] "hello world"

Run Code Online (Sandbox Code Playgroud)

使用这个 Levenstein 工具观察到这种行为——它给出了两种可能的解决方案，相当于这两种转换

#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"

Run Code Online (Sandbox Code Playgroud)

我不知道我们是否可以配置adist为更喜欢一种解决方案？（转换具有相同的“权重”——相同数量的“M”和“D”——不知道如何更喜欢连续数量更多的转换M）

最后，别忘了 adist 允许你传入ignore.case = TRUE(FALSE是默认值)

"trafos"属性的关键adist；从一个字符串到另一个字符串的“转换”：

转换序列作为返回值的“trafos”属性返回，作为包含元素M, 的字符串I，D并S指示匹配、插入、删除和替换

这有三种可能的解决方案.

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

Run Code Online (Sandbox Code Playgroud)

也有adist()与agrep()在基R,并且stringdist包具有运行LCS方法的一些功能.这是一个看看stringsidt.它返回不成对字符的数量.

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0

Run Code Online (Sandbox Code Playgroud)

现在我已经对此进行了更多的探索,我认为adist()可能是要走的路.如果我们设置,counts=TRUE我们得到一系列匹配,插入等.所以如果你给stri_locate()我们,我们可以使用该矩阵来获得从a到b的匹配.

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

Run Code Online (Sandbox Code Playgroud)

所以这些M值表示直接匹配.我们可以去获得子串stri_sub()

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o"

Run Code Online (Sandbox Code Playgroud)

对不起,我没有解释得那么好,因为我不熟悉字符串距离算法.

我不确定你做了什么来获得“你好”的输出。根据下面的试错实验，该LCS函数似乎会 (a) 如果一个字符跟在本来是 LCS 的字符后面，则不会将字符串视为 LCS；(b) 查找多个等长的 LCS（与仅查找第一个的 sub() 不同）；(c) 字符串中元素的顺序并不重要——下面没有说明；(b) LCS 调用中字符串的顺序并不重要——也未显示。

因此，a 的“hello”在 b 中没有 LCS，因为 b 的“hel”后面跟着一个字符。嗯，这就是我目前的假设。

上面A点：

a= c("hello", "hel", "abcd")
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...

a= c("hello", "hel", "abcd1") # added 1 to abcd
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it

a= c("hello", "hel", "abcd") 
b= c("hello1", "abcd") # added 1 to hello
print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character

Run Code Online (Sandbox Code Playgroud)

上面B点：

a= c("hello", "hel", "abcd") 
b= c("hello", "abcd") 
print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，8 月前
查看次数：	6237 次
最近记录：	7 年，4 月前

查找两个字符变量之间的公共子串 12

更多相关链接

脚本中的ggplot图不会显示在Rstudio中 66

许多文本文件中的大量数据 - 如何处理？ 32

使用dplyr按组连接字符串 30

所有/任何列都大于特定值的子集行 15

使用docker文件安装R软件包 15

如何在r图中调整轴从零原点开始 14

如何使用两列条件过滤数据框？ 12

如何为每个组应用geom_smooth()？ 12

如何在R中生成多元正态数据？ 12

删除ggplot2图例中的点透明度 11

如何将空目录添加到Git存储库？ 4039

如何使用Curl从终端/命令行发布JSON数据到测试Spring REST？ 2606

在Windows命令行上是否有相应的"哪个"？ 2231

如何避免JSP文件中的Java代码？ 1649

按值对地图<键,值>进行排序 1569

Dockerfile中CMD和ENTRYPOINT有什么区别？ 1484

match_parent和fill_parent有什么区别？ 1371

什么是右值,左值,x值,glvalues和prvalues？ 1291

使用node.js作为简单的Web服务器 1068

在jQuery中删除事件处理程序的最佳方法？ 1038