ASCII"图像"中的"垂直"正则表达式匹配

Qta*_*tax 58 .net php regex perl pcre

注意:这是关于现代正则表达口味的可能性的问题.这不是使用其他方法解决这个问题的最佳方法.它受到早期问题的启发,但不仅限于正则表达式.

问题

在ASCII"图像"/ art/map/string中:

....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
Run Code Online (Sandbox Code Playgroud)

我想找到三个简单的垂直线形式X:

X
X
X
Run Code Online (Sandbox Code Playgroud)

线条的数量在图像中是可变的,每条线条的宽度也是可变的.

问题

使用正则表达式(PCRE/PHP,Perl,.NET或类似)可以:

  1. 确定是否存在这样的形成
  2. 计算这些结构的数量/匹配它们的起点(上例中为4)

Qta*_*tax 38

回答问题1

要回答第一个问题,可以使用:

(?xm)                    # ignore comments and whitespace, ^ matches beginning of line
^                        # beginning of line
(?:
    .                    # any character except \n
    (?=                  # lookahead
        .*+\n            # go to next line
        ( \1?+ . )       # add a character to the 1st capturing group
        .*+\n            # next line
        ( \2?+ . )       # add a character to the 2nd capturing group
    )
)*?                      # repeat as few times as needed
X .*+\n                  # X on the first line and advance to next line
\1?+                     # if 1st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X .*+\n                  # X on the 2nd line and advance to next line
\2?+                     # if 2st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X                        # X on the 3rd line
Run Code Online (Sandbox Code Playgroud)

Online demo

此表达式适用于Perl,PCRE,Java,并且应该在.NET中工作.

该表达式使用具有自引用捕获组的前瞻为每个前瞻重复添加一个字符(这用于"计数").

\1?+表示\1匹配(或已定义)是否消耗它,并且不返回(不要回溯).在这种情况下,它相当于(?(1) \1 ).这意味着匹配\1if是否\1已定义.

polygenelubricants他的答案中很好地解释了这种具有反向引导的前瞻性我们如何才能将^ nb ^ n与Java正则表达式相匹配?.(他还撰写了关于Java正则表达式的其他令人印象深刻的技巧,包括反向引用和外观.)

回答问题2

简单匹配

当只使用匹配并要求匹配数量的答案(计数)时,问题2答案将是:

不能直接解决具有有限外观的正则表达式风格.而其他风格如Java和.NET可以(例如在m.buettner的.NET解决方案中).

因此,在这种情况下,Perl和PCRE(PHP等)中的普通正则表达式匹配不能直接回答这个问题.

(半?)证明

假设没有可变长度的lookbehinds可用.

你必须以某种方式计算一行前的一行中的字符数X.
唯一的方法就是匹配它们,因为没有可变长度的外观可用,你必须在行的开头开始匹配(至少).
如果您在一行的开头开始匹配,则每行最多只能获得一个匹配.

由于每行可能有多次出现,因此不会全部计算,也不会给出正确的答案.

长/间接解决方案

另一方面,如果我们接受答案作为匹配或替换结果的长度,那么第二个问题可以用 PCRE和Perl(以及其他类型)来回答.

这个解决方案基于m.buettner的"部分PCRE解决方案".

可以简单地将以下表达式的所有匹配替换为$3,将问题2的答案(兴趣模式的数量)作为结果字符串的长度.

^
(?:
    (?:                   # match .+? characters
        .
        (?=               # counting the same number on the following two lines
            .*+\n
            ( \1?+ . )
            .*+\n
            ( \2?+ . )
        )
    )+?
    (?<= X )              # till the above consumes an X
    (?=                   # that matches the following conditions
        .*+\n
        \1?+
        (?<= X )
        .*+\n
        \2?+
        (?<= X )
    )
    (?=                   # count the number of matches
        .*+\n
        ( \3?+ . )        # the number of matches = length of $3
    )
)*                        # repeat as long as there are matches on this line
.*\n?                     # remove the rest of the line
Run Code Online (Sandbox Code Playgroud)

在Perl中可以写成:

$in =~ s/regex/$3/gmx;
$count = length $in;
Run Code Online (Sandbox Code Playgroud)

Online demo

该表达式类似于上述问题1的解决方案,其中包括X在第一个先行中匹配的字符中的一些修改,用量词包裹并计算量词的匹配数.

除了直接匹配之外,它尽可能接近(除了正则表达式之外的额外代码),并且可以是问题2的可接受答案.

测试用例

上述解决方案的一些测试用例和结果.结果显示数字答案(结果字符串的长度),并在括号中显示替换后的结果字符串.

Test #0:
--------------------
X
X
X

result: 1 (X)


Test #1:
--------------------
..X....
..X....
..X....

result: 1 (.)


Test #2:
--------------------
..X.X..
..X.X..
....X..

result: 1 (.)


Test #3:
--------------------
..X....
..X....
...X...

result: 0 ()


Test #4:
--------------------
..X....
...X...
..X....

result: 0 ()


Test #5:
--------------------
....X..
.X..X..
.X.....

result: 0 ()


Test #6:
--------------------
.X..X..
.X.X...
.X.X...

result: 1 (.)


Test #7:
--------------------
.X..X..
.X..X..
.X..X..

result: 2 (.X)


Test #8:
--------------------
XXX
XXX
XXX

result: 3 (XXX)


Test #9:
--------------------
X.X.X
XXXXX
XXXXX
.X.X.

result: 5 (XXXXX)


Test #10:
--------------------
1....X.......
2..X..X...X....
3X.X...X..X.....
4X....XXXXXX.....
5X..XXX...........
6.....X..........
7.........X....X
8..X......X....X....
9..X......X....X....X...
A....X.....
B.X..X..
C.....
XXX
XXX
XXX
.

result: 8 (3458.XXX)
Run Code Online (Sandbox Code Playgroud)

  • +1 ...占有率 - 可选的自我引用群体......今天学到了新东西!:) ...那里有一些非常好的正则表达式魔法;)...也感谢通过polygenelubricants链接该系列 - 我之前没有遇到过,现在知道Java意外地支持无限长的外观. (2认同)

amo*_*mon 28

编辑

以下解决方案存在两个严重问题:

  1. 它们不能匹配XXX从同一行开始的多个序列,因为pos进展太多.
  2. 第二种解决方案是错误的:它匹配连续的行,其中两个X在彼此之上.没有必要连续三个.

因此,所有的upvotes(和奖金)都应该进入的m.buettner全面.NET答案迷人的PCRE答案Qtax自己.


原始答案

这是使用Perl代码嵌入正则表达式的答案.因为Perl正则表达式可以使用代码在正则表达式中声明任意条件或发出部分正则表达式,所以它们不仅限于匹配常规语言或无上下文语言,而是可以匹配Chomsky层次结构中较高级别的语言的某些部分.

您要匹配的语言可以用正则表达式来描述

^ .{n} X .*\n
  .{n} X .*\n
  .{n} X
Run Code Online (Sandbox Code Playgroud)

在哪里n是一个数字.这是一个关于作为匹配的复杂一个ñ b Ñ Ç Ñ语言,其是用于上下文敏感的语言规范的例子.

我们可以轻松匹配第一行,并使用一些Perl代码为其他行发出正则表达式:

    /^ (.*?) X
       (?: .*\n (??{"." x length($1)}) X){2}
    /mx
Run Code Online (Sandbox Code Playgroud)

那很短!它有什么作用?

  • ^ (.*?) X在一行开头的锚点,尽可能少地匹配非换行符,然后匹配X.我们记得X作为捕获组的一行$1.

  • 我们重复一个组两次,与该行的其余部分匹配,换行,然后注入一个与长度相同的字符串匹配的正则表达式$1.在那之后,必须有一个X.

此正则表达式现在将匹配每个具有三个X彼此顶部的字符串.

如果我们想要提取所有这些序列,我们必须是漂亮的.因为序列可能重叠,例如

.X
XX
XX
X.
Run Code Online (Sandbox Code Playgroud)

下一场比赛开始的位置不得超过第一场比赛X.我们可以通过后视和前瞻来做到这一点.Perl仅支持恒定长度的lookbehind,但具有\K提供类似语义的 转义.从而

/^ (.*?) \K X
   (?=( (?: .*\n (??{"."x length($1)}) X ){2} ))
/gmx
Run Code Online (Sandbox Code Playgroud)

将匹配三个垂直Xes的每个序列.测试时间:

$ perl -E'my$_=join"",<>; say "===\n$1X$2" while /^(.*?)\KX(?=((?:.*\n(??{"."x length($1)})X){2}))/gmx' <<'END'
....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
END
===
..X..X...X....
X.X...X..X.....
X....XXXXX
===
X.X...X..X.....
X....XXXXXX.....
X
===
X....XXXXXX.....
X..XXX...........
.....X
===
..............X
..X...........X....
..X...........X
Run Code Online (Sandbox Code Playgroud)

注意:这依赖于至少从Perl 5,v10开始提供的实验性正则表达式功能.代码用v16 perl测试.


解决方案没有嵌入代码

让我们来看看这些台词

...X...\n
...X..\n
Run Code Online (Sandbox Code Playgroud)

我们想断言...每行的前导部分长度相同.我们可以通过基本案例的递归来实现X.*\n:

(X.*\n|.(?-1).)X
Run Code Online (Sandbox Code Playgroud)

如果我们在一行的开头锚定它,我们可以匹配两个垂直Xes.要匹配两行以上,我们必须在前瞻中进行递归,然后将匹配位置前进到下一行,然后重复.为此,我们简单匹配.*\n.

这导致以下正则表达式可以匹配具有三个垂直Xes 的字符串:

/ ^
  (?:
    (?=( X.*\n | .(?-1). ) X)
    .*\n # go to next line
  ){2}
/mx
Run Code Online (Sandbox Code Playgroud)

但这还不够好,因为我们希望匹配所有这些序列.要做到这一点,我们基本上把整个正则表达式放在了前瞻性.正则表达式引擎确保每次都能提升位置以产生新的匹配.

/ ^
  (?=
    (
      (?:
          (?= (X.*\n | .(?-1). ) X)
          .*\n # go to next line
      ){2}
      .* # include next line in $1
    )
  )
/mx
Run Code Online (Sandbox Code Playgroud)

测试时间:

$ perl -E'my$_=join"",<>; say "===\n$1" while /^(?=((?:(?=(X.*\n|.(?-1).)X).*\n){2}.*))/gmx' <<'END'
....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
END
===
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
===
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
===
X....XXXXXX.....
X..XXX...........
.....X..........
===
..............X
..X...........X....
..X...........X....X...
Run Code Online (Sandbox Code Playgroud)

因此,这与嵌入代码的解决方案一样有效,也就是说,它将每组线与垂直Xes 匹配,而不是每组es 匹配X.(实际上,这个解决方案对我来说比嵌入式代码更脆弱)

  • 没有嵌入代码的解决方案不起作用.请参阅:http://ideone.com/UxVxJ9 +1尝试,但它比这更棘手.;-) (3认同)

Mar*_*der 27

首先:精彩的问题.我认为尝试将正则表达式引擎发挥到极限是非常有益的.

基本的.NET解决方案

你们在评论中说,使用.NET会很容易,但由于目前还没有答案,我以为我会写一个.

您可以使用.NET的可变长度lookbehind和balance groups来解决问题1和问题2.大多数工作都是由平衡组完成的,但是可变长度的后视对于能够在同一条线上检测到多个匹配至关重要.

无论如何,这是模式:

(?<=                  # lookbehind counts position of X into stack
  ^(?:(?<a>).)*       # push an empty capture on the 'a' stack for each character
                      # in front of X
)                     # end of lookbehind

X                     # match X

(?=.*\n               # lookahead checks that there are two more Xs right below
  (?:(?<-a>)(?<b>).)* # while we can pop an element from stack 'a', push an
                      # element onto 'b' and consume a character
  (?(a)(?!))          # make sure that stack 'a' is empty
  X.*\n               # match X and the rest of the line
  (?:(?<-b>).)*       # while we can pop an element from stack 'b', and consume
                      # a character
  (?(b)(?!))          # make sure that stack 'b' is empty
  X                   # match a final X
)                     # end of lookahead
Run Code Online (Sandbox Code Playgroud)

此图案具有与其一起使用RegexOptions.Multiline^(并与明显线的开端匹配RegexOptions.IgnorePatternWhitespace为freespacing模式下工作).

以下是一些额外的评论:

通过将除了初始X之外的所有内容放入外观中,我们可以解决重叠匹配甚至是从同一行开始的匹配问题.但是,lookbehind必须具有可变长度,这肯定会限制.NET中的任何此类解决方案.

其余的依赖于对平衡群体的良好把握.我不会在这里详细讨论这个问题,因为它本身就有很长的答案.(有关更多信息,请参阅MSDN此博客文章)

lookbehind只是匹配^.*,所以一直到行的开始,但是每次.我们将空捕获推到堆栈上a,从而计算我们X作为堆栈大小的位置.

然后在前瞻中消耗剩余的线之后,我们再次匹配.*,但在消耗每个之前.,我们从堆栈中弹出一个元素a(导致失败,一旦a为空)并将空捕获推到b(这样我们就不知道了) t忘记第三行必须有多少个字符).

为了确保我们真正清空整个堆栈,我们使用(?(a)(?!)).这是一个条件模式,(?!)如果堆栈a不为空则尝试匹配(否则只是跳过).并且(?!)是一个空洞的负面前瞻,总是失败.因此,这只是编码," a不是空的?失败.否则,继续".

现在知道我们已经在新行中消耗了恰当数量的字符,我们尝试匹配a X和该行的其余部分.然后我们再次使用堆栈重复相同的过程b.现在没有必要推进任何新的堆栈,因为如果这有效,我们就完成了.我们b在此之后检查是否为空,并匹配第三个X.

最后,一个优化方面的注意事项:如果所有重复都包含在原子组中(从而模拟.NET所不具备的占有量词),这种模式仍然有效!这样可以节省大量的回溯.此外,如果我们至少将堆栈弹出量词放在原子组中,我们就可以摆脱这两种(?(...)(?!))检查(因为只有在前一次重复必须回溯的情况下才需要这些检查).

完整的.NET解决方案

(只有最勇敢的冒险者才能跟随我进入真正黑暗的洞穴,我即将进入......)

正如评论中所讨论的,这个解决方案有一个缺点:它计算重叠匹配.例如

..X..
..X..
..X..
..X..
Run Code Online (Sandbox Code Playgroud)

给出两个匹配,一个在第一行,一个在第二行.我们想要避免这种情况,只报告一个匹配(如果有6到8 X秒,则报告两个,如果有9到11 X秒则报告三个,依此类推).此外,我们想报告第1,第4,第7,......的比赛X.

我们可以调整上面的模式以允许这个解决方案,要求第一个X模式前面X有满足我们要求的3个其他的整数倍.检查它的基本思想使用与之前相同的堆栈操作(除了我们在3个堆栈之间移动东西,以便在找到三个Xs之后我们最终在我们开始的地方).为了做到这一点,我们不得不改变一下后面的观点.

虽然有一个问题..NET的可变长度lookbehind使用另一个.NET独有的功能,RightToLeftMode其中从右到左读取(和匹配)模式.通常这不需要打扰我们,但是当我们将它与平衡组合在一起时,我们可能会遇到一些令人不快的意外.特别是,在考虑我们的捕获堆栈如何演变时,我们还需要构建(并读取)从右到左(或从下到上)的表达式.

因此,当您阅读以下表达式(和我的注释)时,从最外层的lookbehind结束开始(您将需要滚动一点) - 即在唯一的顶级之前X; 然后一直读到顶部.然后继续看后面.

(?<=                  
  # note that the lookbehind below does NOT affect the state of stack 'a'!
  # in fact, negative lookarounds can never change any capturing state.
  # this is because they have to fail for the engine to continue matching.
  # and if they fail, the engine needs to backtrack out of them, in which
  # case the previous capturing state will be restored.
  (?<!                # if we get here, there is another X on top of the last
                      # one in the loop, and the pattern fails
    ^                 # make sure we reached the beginning of the line
    (?(a)(?!))        # make sure that stack 'a' is empty
    (?:(?<-a>).)*     # while we can pop an element from stack 'a', and consume
                      # a character
    X.*\n             # consume the next line and a potential X
  )
  # at this point we know that there are less than 3 Xs in the same column
  # above this position. but there might still be one or two more. these
  # are the cases we now have to eliminate, and we use a nested negative
  # lookbehind for this. the lookbehind simply checks the next row and
  # asserts that there is no further X in the same column.
  # this, together with the loop, below means that the X we are going to match
  # is either the topmost in its column or preceded by an integer multiple of 3
  # Xs - exactly what we are looking for.
  (?:

    # at this point we've advanced the lookbehind's "cursor" by exactly 3 Xs
    # in the same column, AND we've restored the same amount of captures on
    # stack 'a', so we're left in exactly the same state as before and can
    # potentially match another 3 Xs upwards this way.
    # the fact that stack 'a' is unaffected by a full iteration of this loop is
    # also crucial for the later (lookahead) part to work regardless of the
    # amount of Xs we've looked at here.

    ^                 # make sure we reached the beginning of the line
    (?(c)(?!))        # make sure that stack 'a' is empty
    (?:(?<-c>)(?<a>).)* # while we can pop an element from stack 'c', push an
                      # element onto 'a' and consume a character
    X.*\n             # consume the next line and a potential X
    (?(b)(?!))        # make sure that stack 'b' is empty
    (?:(?<-b>)(?<c>).)* # while we can pop an element from stack 'b', push an
                      # element onto 'c' and consume a character
    X.*\n             # consume the next line and a potential X
    (?(a)(?!))        # make sure that stack 'a' is empty
    (?:(?<-a>)(?<b>).)* # while we can pop an element from stack 'a', push an
                      # element onto 'b' and consume a character
    X.*\n             # consume the next line and a potential X
  )*                  # this non-capturing group will match exactly 3 leading
                      # Xs in the same column. we repeat this group 0 or more
                      # times to match an integer-multiple of 3 occurrences.
  ^                   # make sure we reached the beginning of the line
  (?:(?<a>).)*        # push an empty capture on the 'a' stack for each
                      # character in front of X
)                     # end of lookbehind (or rather beginning)

# the rest is the same as before    

X                     # match X
(?=.*\n               # lookahead checks that there are two more Xs right below
  (?:(?<-a>)(?<b>).)* # while we can pop an element from stack 'a', push an
                      # element onto 'b' and consume a character
  (?(a)(?!))          # make sure that stack 'a' is empty
  X.*\n               # match X and the rest of the line
  (?:(?<-b>).)*       # while we can pop an element from stack 'b', and consume
                      # a character
  (?(b)(?!))          # make sure that stack 'b' is empty
  X                   # match a final X
)                     # end of lookahead
Run Code Online (Sandbox Code Playgroud)

RegexHero.net上的工作演示.

I interspersed all explanation right with the pattern this time. So if you read the pattern in the way I recommended above, you get the explanation right when you need it...

Now that was one hell of a beast. But it satisfies the entire specification now and shows off just how powerful .NET's regex flavor really is. And, although this seems quite horrible, I think (once you realise the right-to-left-thing) this is much more easily understandable than a comparable solution with PCRE (using recursion or otherwise).

As Kobi mentioned in a comment below, this could be shortened a good bit, if you accept that your results are found in multiple captures of a single match (e.g., if you have a column of 7 Xs you only get one match, but with 2 captures in a certain group). You can do this by repeating the main (lookahead) part 1 or more times and capturing the initial X (put everything in a lookahead though). Then the lookbehind does not need to count off triples of Xs, but only has to check that there is no leading X. This would probably cut the size of the pattern in half.

The partial PCRE solution

(If only the bravest of adventurers followed me through the last solution, I am probably only left with madmen on the next journey...)

To prove what I just said about how the above solution compares to PCRE, let's look at how we can even remotely solve the full problem in PCRE. We'll have to work a good bit harder without variable-length lookbehinds and balancing groups.

Qtax (the OP) provided a brilliant solution to his first question (checking whether the string contains any X-column) using self-referencing groups to count. This is a very elegant and compact solution. But because each match goes from the beginning of the line to the X that starts the column, and matches cannot overlap, we can't get multiple matches per line. We could try to put everything in a lookahead (so that nothing is actually matched), but two zero-width matches will also never start at the same position - so we'll still get only one match per candidate line.

However it is indeed possible to solve at least the first part of question 2 with PCRE: count the number of columns starting in each line (and hence to total amount of X columns). Since we cannot get this count in the form of individual matches (see previous paragraph), and we cannot get this count in the form of individual groups or captures (since PCRE provides only a fixed and finite number of captures, as opposed to .NET). What we can do instead is to encode the number of columns in the matches.

Here is how: for each line we check if there's a column starting. If so, we include one character in a certain capturing group. Then, before reporting a successful match, we try to find as many further columns as possible - for each one adding a character to that particular group. By doing this, we encode the number of columns starting in each line in the length of that particular capture.

Actually realizing this concept in a regex is a lot more complicated than it may first sound (and it already sounds quite complicated). Anyway, here it is:

^                        
(?:(?|
  (?(5)(?![\s\S]*+\5))      
  (?!(?!)()()) 
  (?=
    (?:
      .                  
      (?=                
        .*+\n            
        ( \3? . )   
        .*+\n        
        ( \4? . )    
      )
    )*?              
    X .*+\n          
    \3               
    X .*+\n          
    \4               
  )
  ()
|
  (?(5)(?=[\s\S]*+\5)|(?!))
  (?:
    .
    (?=
      .*+\n
      ( \1? .)
      .*+\n
      ( \2? .)
    )
  )+?
  (?=
    (?<=X).*+\n
    (\1)         
    (?<=X).*+\n
    (\2)         
    (?<=X)     
  )
  (?=
    ([\s\S])   
    [\s\S]*
    ([\s\S] (?(6)\6))
  )
){2})+
Run Code Online (Sandbox Code Playgroud)

(Actually, it's a bit easier than that - see Qtax's answer for how to simplify this approach. I'll leave this approach here anyway for academic reasons, as some very advanced and interesting techniques can be learned from it - see the summary at the end.)

Yes, there are no annotations. I figured, no one would actually read them anyway, so instead I'll try to break this expression down in parts (I'll go for a top-down approach).

So let's look at the outer layer of onion from hell:

^                        
(?:(?|
  checkForNextColumn
|
  countAndAdvance
){2})+
Run Code Online (Sandbox Code Playgroud)

So our matches are again anchored to the beginnings of lines. Then we have a (?:...{2})+ which means an even number of repetitions of something. And that something is an alternation of two subpatterns. These subpatterns represent the steps I mentioned above. The first one checks that there is another column starting in the current line, the second one registers a count and prepares the engine's state for another application of the first subpattern. So control is given to the second pattern - the first just checks for another column using a lookahead and is hence a zero-width pattern. This is why I cannot simply wrap everything in + but have to do the {2})+ thing - otherwise the zero-width component would be tried only once; that's a necessary optimization applied by pretty much all engines to avoid infinite loops with patterns like (a*)+.

There is one more (very important detail): I used (?|...) for the alternation. In this kind of grouping, each alternative starts with the same group number. Hence in /(?|(a)|(b))/ both a and b can be captured into group 1. This is the crucial trick that allows "communication" between subpatterns, as they can modify the same groups.

Anyway... so we have these two subpatterns. We'd like to make sure that control really alternates between them. So that each group fails if it was the last one that matched. We do this by wrapping the pattern in some grouping-and-referencing magic:

^(?:(?|
  (?(5)(?![\s\S]*+\5))       # if group 5 has matched before make sure that
                             # it didn't match empty
  checkForNextColumn         # contains 4 capturing groups
  ()                         # this is group 5, match empty
|
  (?(5)(?=[\s\S]*+\5)|(?!))  # make sure that group 5 is defined and that it
                             # matched empty
  advanceEngineState         # contains 4 capturing groups
  (?=
    ([\s\S])                 # this is group 5, match non-empty
    [\s\S]*                  # advance to the end very end of the string
    ([\s\S] (?(6)\6))             # add a character from the end of the string to
                             # group 6
  )
){2})+
Run Code Online (Sandbox Code Playgroud)

So at the end of each alternative, we'll invalidate the condition for this alternative to even start matching. At the end of the second alternative we'll also include a character into group 6, using the technique outlined by Qtax. This is the counting step. I.e., group 6 will contain as many characters as there are columns starting in the current line.

Now checkForNextColumn will really just be Qtax's solution inside a lookahead. It needs one more modification though and to justify this we'll look into advanceEngineState first.

Let's think about how we would want to modify the state, for Qtax's solution to match a second column in a row. Say we have input

..X..X..
..X..X..
..X..X..
Run Code Online (Sandbox Code Playgroud)

and we want to find the second column. This could be accomplished, by starting the match from the position just after the first X and having groups \1 and \2 already initialised to the first three characters (..X) of rows 2 and 3, respectively (instead of them being empty).

Now let's try to do this: match everything up to and including the next X that starts a column, then fill two groups with the corresponding line-prefixes for use in the checkForNextColumn pattern. This is again pretty much Qtax's pattern, except that we count the X in (instead of stopping right before it), and that we need to add the capturing into a separate group. So here is advanceEngineState:

(?:
  .
  (?=
    .*+\n
    ( \1? .)
    .*+\n
    ( \2? .)
  )
)+?
(?=
  (?<=X) .*+\n
  (\1)        
  (?<=X) .*+\n
  (\2)        
  (?<=X)
)
Run Code Online (Sandbox Code Playgroud)

Note how I turned the Xs into lookbehinds, to go one character further, and how I effectively copy the final contents of \1 into \3 and those of \2 into \4.

So if we now use Qtax's solution as checkForNextColumn in a lookahead, using groups \3 and \4 instead of \1 and \2, we should be done.

But just how do we make those groups \3 and \4 instead of \1 and \2? We could start the pattern with ()(), which would always match, without affecting the engine's cursor, but increase the group count by 2. However, this is problematic: this resets groups 1 and 2 to empty strings, which if we find a second column, advanceEngineState will be in an inconsistent state (as the engine's global position has been advanced, but the counting groups are zero again). So we want to get those two groups into the pattern, but without affecting what they are currently capturing. We can do this by utilizing something I already mentioned with the .NET solutions: groups in negative lookarounds do not affect the captured contents (because the engine needs to backtrack out of the lookaround to proceed). Hence we can use (?!(?!)()()) (a negative lookahead that can never cause the pattern to fail) to include two sets of parentheses in our pattern, that are never used. This allows us to work with groups 3 and 4 in our first subpattern, while keeping groups 1 and 2 untouched for the second subpatterns next iteration. In conclusion this is checkForNextColumn:

(?!(?!)()()) 
(?=
  (?:
    .                  
    (?=                
      .*+\n            
      ( \3? . )   
      .*+\n        
      ( \4? . )    
    )
  )*?              
  X .*+\n          
  \3               
  X .*+\n          
  \4               
)
Run Code Online (Sandbox Code Playgroud)

Which, for the most part actually looks really familiar.

So this is it. Running this against some input will give us a group 6 which contains one capture for each line that has a column starting - and the capture's length will tell us how many columns started there.

Yes, it really works (live demo).

Note that this (like the basic .NET solution) will overcount columns that are more than 3 Xs long. I suppose it is possible to correct this count with lookaheads (in a similar way to the lookbehind of the full .NET solution), but this is left as an exercise to the reader.

It's a bit unfortunate that the base problem of this solution is already very convoluted and bloats the solution (75% of the lines are mostly just copies of Qtax's solution). Because the surrounding framework has some really interesting techniques and lessons:

  • We can have multiple subpatterns that accomplish specific matching/counting tasks, and have them "communicate" through mutual capturing groups, by putting them in a (?|...) alternation and looping over them.
  • We can force zero-width alternatives to be carried out over and over again by wrapping them in a finite quantifier like {2} before putting everything into +.
  • Group numbers can be skipped in one subpattern (without affecting the captured contents) by putting them into a never-failing negative lookahead like (?!(?!)()).
  • Control can be passed back and forth between subpatterns by capturing something or nothing in a certain group that is checked upon entering the alternation.

This allows for some very powerful computations (I've seen claims that PCRE is in fact Turing-complete) - although this is certainly the wrong approach for productive use. But still trying to understand (and come up) with such solutions can be a very challenging and somehow rewarding exercise in problem solving.


bri*_*foy 11

如果你想找到一个"垂直"模式,这是一个解决方案.如果您还想匹配"水平"模式,请尝试使用单独的匹配,也许检查重叠匹配位置.请记住,计算机不知道线是什么.这是由人类组成的任意事物.字符串只是一维序列,我们将某些字符表示为行尾.

#!/usr/local/perls/perl-5.18.0/bin/perl
use v5.10;

my $pattern = qr/XXX/p;

my $string =<<'HERE';
....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....
HERE


$transposed = transpose_string( $string );

open my $tfh, '<', \ $transposed;
while( <$tfh> ) {
    while( /$pattern/g ) {
        my $pos = pos() - length( ${^MATCH} );
        push @found, { row => $pos, col => $. - 1 };
        pos = $pos + 1; # for overlapping matches
        }
    }

# row and col are 0 based
print Dumper( \@found ); use Data::Dumper;

sub transpose_string {
    my( $string ) = @_;

    open my $sfh, '<', \ $string;

    my @transposed;
    while( <$sfh> ) {
        state $row = 0;
        chomp;
        my @chars = split //;

        while( my( $col, $char ) = each @chars ) {
            $transposed[$col][$row] = $char;
            }

        $row++;
        }

    my @line_end_positions = ( 0 );
    foreach my $col ( 0 .. $#transposed ) {
        $transposed .= join '', @{ $transposed[$col] };
        $transposed .= "\n";
        }
    close $sfh;

    return $transposed;
    }
Run Code Online (Sandbox Code Playgroud)

  • 我确实喜欢简单的转换图像优先于深奥的正则表达式... (3认同)

jay*_*tea 5

问题2的工作解决方案

在Perl/PCRE中完全可以实现!:)

对不起,我参加聚会的时间有点晚了,但我想指出的是,你实际上可以算出发现的XXX数量; 也就是说,构造表达式,使得在执行全局匹配时每个阵列只有一个匹配.不过,这很棘手.

这里是:

\A(?:(?=(?(3)[\s\S]*(?=\3\z))(?|.(?=.*\n(\1?+.).*\n(\2?+.))|.*\n()())+?(?<=X)(?=.*\n\1(?<=X).*\n\2(?<=X))(?=([\s\S]*\z)))(?=[\s\S]*([\s\S](?(4)\4)))[\s\S])+[\s\S]*(?=\4\z)|\G(?!\A|[\s\S]?\z)
Run Code Online (Sandbox Code Playgroud)

在regex101上行动

我应该添加一些评论!对于那些感兴趣的人:

\A(?:
    (?=
        (?(3)[\s\S]*(?=\3\z))                   # Resume from where we ended last iteration

        (?|                                     # Branch-reset group used to reset \1
            .(?=.*\n(\1?+.).*\n(\2?+.))         # and \2 to empty values when a new line
            |                                   # is reached. ".*\n" is used to skip the
            .*\n()()                            # rest of a line that is longer than the
        )+?                                     # ones below it.

        (?<=X)(?=.*\n\1(?<=X).*\n\2(?<=X))      # Find a XXX formation

        (?=([\s\S]*\z))                         # Save the rest of the line in \3 for 
    )                                           # when we resume looking next iteration

    (?=[\s\S]*([\s\S](?(4)\4)))                 # For every formation found, consume another 
                                                # character at the end of the subject

    [\s\S]                                      # Consume a character so we can move on

    )+

[\s\S]*(?=\4\z)                                 # When all formations around found, consume
                                                # up to just before \4 at the subject end.

|

\G(?!\A|[\s\S]?\z)                              # Now we just need to force the rest of the 
                                                # matches. The length of \4 is equal to the 
                                                # number of formations. So to avoid overmatching,
                                                # we need to exclude a couple of cases.
Run Code Online (Sandbox Code Playgroud)

我很害怕; 发生了什么??

基本上,我们在一个重复的非捕获组中穿过整个主体,从一个XXX阵型移动到下一个.对于找到的每个构造,将另一个角色添加到主题末尾的计数器上(存储在\ 4中).有几个障碍需要克服:

  • 如果我们一次性匹配所有内容,我们如何从一条线移到另一条线?解决方案:使用分支重置组在遇到新行时重置\ 1和\ 2.

  • 如果我们在末尾有一个带有小"\nX \nX \nX"的大型ASCII网格怎么办?如果我们将主题从一个阵型消耗到下一个阵型,我们将开始进入我们的柜台.解决方案:一次只消耗一个字符,并将形成搜索逻辑包装在前瞻中.

  • 但是,如果我们不从一个阵型消耗到下一个阵型,我们如何知道在非捕获组的下一次迭代中何处继续查看?解决方案:当找到阵型时,将该位置的角色捕捉到主体的最末端 - 一个始终可以参考的固定点.这与我用来匹配嵌套括号而不递归的技巧相同,这真正体现了前向引用的强大功能.

这很有趣,我很想看到更多这样的帖子!


归档时间:

查看次数:

2591 次

最近记录:

7 年,1 月 前