如何在字符串中获得可能重叠的匹配

wub*_*dub 22 javascript ruby regex

我正在寻找一种方法,无论是在Ruby还是Javascript中,它都会在字符串中为正则表达式提供所有匹配,可能重叠.


假设我有str = "abcadc",我希望找到a后跟任意数量字符的事件,然后是c.我正在寻找的结果是["abc", "adc", "abcadc"].有关如何实现这一目标的任何想法?

str.scan(/a.*c/)会给我的["abcadc"],str.scan(/(?=(a.*c))/).flatten会给我的["abcadc", "adc"].

ndn*_*kov 11

def matching_substrings(string, regex)
  string.size.times.each_with_object([]) do |start_index, maching_substrings|
    start_index.upto(string.size.pred) do |end_index|
      substring = string[start_index..end_index]
      maching_substrings.push(substring) if substring =~ /^#{regex}$/
    end
  end
end

matching_substrings('abcadc', /a.*c/) # => ["abc", "abcadc", "adc"]
matching_substrings('foobarfoo', /(\w+).*\1/) 
  # => ["foobarf",
  #     "foobarfo",
  #     "foobarfoo",
  #     "oo",
  #     "oobarfo",
  #     "oobarfoo",
  #     "obarfo",
  #     "obarfoo",
  #     "oo"]
matching_substrings('why is this downvoted?', /why.*/)
  # => ["why",
  #     "why ",
  #     "why i",
  #     "why is",
  #     "why is ",
  #     "why is t",
  #     "why is th",
  #     "why is thi",
  #     "why is this",
  #     "why is this ",
  #     "why is this d",
  #     "why is this do",
  #     "why is this dow",
  #     "why is this down",
  #     "why is this downv",
  #     "why is this downvo",
  #     "why is this downvot",
  #     "why is this downvote",
  #     "why is this downvoted",
  #     "why is this downvoted?"]
Run Code Online (Sandbox Code Playgroud)


aef*_*aef 11

在Ruby中,您可以使用以下方法获得预期结果:

str = "abcadc"
[/(a[^c]*c)/, /(a.*c)/].flat_map{ |pattern| str.scan(pattern) }.reduce(:+)
# => ["abc", "adc", "abcadc"]
Run Code Online (Sandbox Code Playgroud)

这种方式是否适合您,在很大程度上取决于您真正想要实现的目标.

我试着把它放到一个单独的表达式中,但我无法使它工作.我真的想知道是否有一些科学原因,这不能被正则表达式解析,或者我只是不太了解Ruby的解析器Oniguruma来做到这一点.

  • 假设OP的字符串和正则表达式只是一个例子,这并没有给出问题的通用答案. (4认同)
  • 该解决方案可以轻松适应您的第一个示例.对于第二个,你可能是对的.我不知道如何适应它.这就是为什么我写这句话说它取决于OP究竟想要实现的目标. (2认同)
  • @WilliamFeng在'abcadcdc`中的[期望结果](http://ideone.com/52UNhp)应该包括`abcadc`,`adcdc`? (2认同)

Wik*_*żew 8

在JS中:

function doit(r, s) {
  var res = [], cur;
  r = RegExp('^(?:' + r.source + ')$', r.toString().replace(/^[\s\S]*\/(\w*)$/, '$1'));
  r.global = false;
  for (var q = 0; q < s.length; ++q)
    for (var w = q; w <= s.length; ++w)
      if (r.test(cur = s.substring(q, w)))
        res.push(cur);
  return res;
}
document.body.innerHTML += "<pre>" + JSON.stringify(doit( /a.*c/g, 'abcadc' ), 0, 4) + "</pre>";
Run Code Online (Sandbox Code Playgroud)


Mar*_*eed 8

您想要所有可能的匹配,包括重叠匹配.正如您所指出的那样," 如何找到与正则表达式重叠匹配? " 的前瞻技巧对您的情况不起作用.

在一般情况下,我唯一能想到的就是生成字符串的所有可能的子字符串,并根据正则表达式的锚定版本检查每个字符串.这是蛮力,但它的确有效.

红宝石:

def all_matches(str, regex)
  (n = str.length).times.reduce([]) do |subs, i|
     subs += [*i..n].map { |j| str[i,j-i] }
  end.uniq.grep /^#{regex}$/
end

all_matches("abcadc", /a.*c/) 
#=> ["abc", "abcadc", "adc"]
Run Code Online (Sandbox Code Playgroud)

使用Javascript:

function allMatches(str, regex) {
  var i, j, len = str.length, subs={};
  var anchored = new RegExp('^' + regex.source + '$');
  for (i=0; i<len; ++i) {
    for (j=i; j<=len; ++j) {
       subs[str.slice(i,j)] = true;
    }
  }
  return Object.keys(subs).filter(function(s) { return s.match(anchored); });
}
Run Code Online (Sandbox Code Playgroud)


Ale*_*kin 5

? str = "abcadc"
? from = str.split(/(?=\p{L})/).map.with_index { |c, i| i if c == 'a' }.compact
? to   = str.split(/(?=\p{L})/).map.with_index { |c, i| i if c == 'c' }.compact
? from.product(to).select { |f,t| f < t }.map { |f,t| str[f..t] }
#? [
#  [0] "abc",
#  [1] "abcadc",
#  [2] "adc"
# ]
Run Code Online (Sandbox Code Playgroud)

我相信,有一种奇特的方法来查找字符串中字符的所有索引,但我无法找到它:(任何想法?

拆分"unicode char boundary"使其能够使用'a?bc?'或等字符串'U?ve Østergaard'.

对于更通用的解决方案,它接受任何"from"和"to"序列,应该只引入一点修改:在字符串中查找"from"和"to"的所有索引.


Car*_*and 5

这是一种类似于@ndn和@ Mark的方法,适用于任何字符串和正则表达式.我已经实现了这个方法,String因为我希望看到它.这不是一个伟大的赞美String#[]String#scan

class String
  def all_matches(regex)
    return [] if empty?
    r = /^#{regex}$/
    1.upto(size).with_object([]) { |i,a|
      a.concat(each_char.each_cons(i).map(&:join).select { |s| s =~ r }) }
  end
end

'abcadc'.all_matches /a.*c/
  # => ["abc", "abcadc", "adc"]
'aaabaaa'.all_matches(/a.*a/)
  #=> ["aa", "aa", "aa", "aa", "aaa", "aba", "aaa", "aaba", "abaa", "aaaba",
  #    "aabaa", "abaaa", "aaabaa", "aabaaa", "aaabaaa"] 
Run Code Online (Sandbox Code Playgroud)