匹配正确添加两个二进制数与PCRE正则表达式

Question

匹配正确添加两个二进制数与PCRE正则表达式

是否有可能匹配形式的添加(?<a>[01]+)\s*\+\s*(?<b>[01]+)\s*=\s*(?<c>[01]+),在哪里a + b == c(如二进制加法)必须保持？

这些应匹配:

0 + 0 = 0
0 + 1 = 1
1 + 10 = 11
10 + 111 = 1001
001 + 010 = 0011
1111 + 1 = 10000
1111 + 10 = 10010

Run Code Online (Sandbox Code Playgroud)

这些不符合:

0 + 0 = 10
1 + 1 = 0
111 + 1 = 000
1 + 111 = 000
1010 + 110 = 1000
110 + 1010 = 1000

Run Code Online (Sandbox Code Playgroud)

Answer 1

bwo*_*ebi 63

TL;DR: Yes, it indeed is possible (use with Jx flags):

(?(DEFINE)
(?<add> \s*\+\s* )
(?<eq> \s*=\s* )
(?<carry> (?(cl)(?(cr)|\d0)|(?(cr)\d0|(*F))) )
(?<digitadd> (?(?= (?(?= (?(l1)(?(r1)|(*F))|(?(r1)(*F))) )(?&carry)|(?!(?&carry))) )1|0) )
(?<recursedigit>
  (?&add) 0*+ (?:\d*(?:0|1(?<r1>)))? (?(ro)|(?=(?<cr>1)?))\k<r> (?&eq) \d*(?&digitadd)\k<f>\b
| (?=\d* (?&add) 0*+ (?:\k<r>(?<ro>)|\d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b) \d(?&recursedigit)
)
(?<checkdigit> (?:0|1(?<l1>)) (?=(?<cl>1)?) (?<r>) (?<f>) (?&recursedigit) )
(?<carryoverflow>
  (?<x>\d+) 0 (?<y> \k<r> (?&eq) 0*\k<x>1 | 1(?&y)0 )
| (?<z> 1\k<r> (?&eq) 0*10 | 1(?&z)0 )
)
(?<recurseoverflow>
  (?&add) 0*+ (?(rlast) \k<r> (?&eq) 0*+(?(ro)(?:1(?=0))?|1)\k<f>\b
                | (?:(?<remaining>\d+)(?=0\d* (?&eq) \d*(?=1)\k<f>\b)\k<r> (?&eq) (*PRUNE) 0*\k<remaining>\k<f>\b
                   | (?&carryoverflow)\k<f>\b))
| (?=\d* (?&add) 0*+ (?:\k<r>(?<ro>)|(?=(?:\d\k<r>(?&eq)(?<rlast>))?)\d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b)
  \d(?&recurseoverflow)
)
(?<s>
  (| 0*? (?<arg>[01]+) (?&add) 0+ | 0+ (?&add) 0*? (?<arg>[01]+)) (?&eq) (*PRUNE) 0* \k<arg>
| 0*+
  (?=(?<iteratedigits> (?=(?&checkdigit))\d (?:\b|(?&iteratedigits)) ))
  (?=[01]+ (?&add) [01]+ (?&eq) [01]+ \b)
  (?<r>) (?<f>) (?&recurseoverflow)
)
)
\b(?&s)\b

Run Code Online (Sandbox Code Playgroud)

Live demo: https://regex101.com/r/yD1kL7/26

[Update: Due to a bug in PCRE, the code only works for all cases with PCRE JIT active; thanks to Qwerp-Derp for noticing; without JIT e.g. 100 + 101 = 1001 fails to match.]

That's quite a monster regex. So, let's build it step by step to understand what's going on.

Hint: For easier memorization and following through the explanation, let me explain the names of the single or two digit capturing group names (all except the first two are flags [see below]):

r => right; it contains the part of the right operand right to a given offset
f => final; it contains the part of the final operand (the result) right to a given offset
cl => carry left; the preceding digit of the left operand was 1
cr => carry right; the preceding digit of the right operand was 1
l1 => left (is) 1; the current digit of the left operand is 1
r1 => right (is) 1; the current digit of the right operand is 1
ro => right overflow; the right operand is shorter than the current offset
rlast => right last; the right operand is at most as long as the current offset

Run Code Online (Sandbox Code Playgroud)

对于更具可读性+和=可能的前导和尾随空格,有两个捕获组(?<add> \s*\+\s*)和(?<eq> \s*=\s*).

我们正在进行一项补充.由于它是正则表达式,我们需要立即验证每个数字.那么,看看这背后的数学:

检查添加一位数

current digit = left operand digit + right operand digit + carry of last addition

Run Code Online (Sandbox Code Playgroud)

我们怎么知道随身携带？

我们可以简单地看一下最后一位数字:

carry =    left operand digit == 1 && right operand digit == 1
        || (left operand digit == 1 || right operand digit == 1) && result digit == 0

Run Code Online (Sandbox Code Playgroud)

该逻辑由捕获组提供carry,定义如下:

(?<carry> (?(cl)(?(cr)|\d0)|(?(cr)\d0|(*F))) )

Run Code Online (Sandbox Code Playgroud)

同cl是是否最后左操作数位为1或不和cr是否最后一次正确的操作数位为1或不; \d0是检查结果中的最后一位数是否为0.

Note: (?(cl) ... | ...) is a conditional construct checking whether a capturing group has been defined or not. Due to capturing groups being scoped to each level of recursion, this is easily usable as a mechanism to set a boolean flag (can be set with (?<cl>) anywhere) which can later be conditionally acted upon.

Then the actual addition is a simple:

is_one = ((left operand digit == 1) ^ (right operand digit == 1)) ^ carry

Run Code Online (Sandbox Code Playgroud)

expressed by the digitadd capturing group (using a ^ b == (a && !b) || (!a && b), using l1 being whether the digit of the left operand is equal to 1 and r1 equivalently for the right digit:

(?<digitadd> (?(?= (?(?= (?(l1)(?(r1)|(*F))|(?(r1)(*F))) )(?&carry)|(?!(?&carry))) )1|0) )

Run Code Online (Sandbox Code Playgroud)

Checking addition at a given offset

Now, that we can verify, given a defined cr, cl, l1 and r1 whether the digit at the result is correct by simply invoking (?&digitadd) at that offset.

... at that offset. That's the next challenge, finding said offset.

The fundamental issue is, given three strings with a known separator in between, how to find the correct offset from the right.

For example 1***+****0***=****1*** (the separators are + and = here, and * denoting any arbitrary digit).

Or even, more fundamentally: 1***+****0***=1.

This can be matched with:

(?<fundamentalrecursedigit>
  \+ \d*(?:1(?<r1>)|0)\k<r> = (?(r1) (?(l1) 0 | 1) | (?(l1) 1 | 0) ) \b
| (?=\d* + \d*(?<r>\d\k<r>) =) \d (?&fundamentalrecursedigit)
)
(?<fundamentalcheckdigit>
  # Note: (?<r>) is necessary to initialize the capturing group to an empty string
  (?:1(?<l1>)|0) (?<r>) (?&fundamentalrecursedigit)
)

Run Code Online (Sandbox Code Playgroud)

(Many thanks here to nhahdth for his solution to this issue — altered here a little to fit the example)

First we're storing information about the digit at the current offset ((?:1(?<l1>)|0) - set the flag (i.e. capturing group which can be checked against with (?(flag) ... | ...)) l1 when the current digit is 1.

Then we build the string to the right of the searched offset of the right operand recursively with (?=\d* + \d*(?<r>\d\k<r>) =) \d (?&fundamentalrecursedigit), which advances by one digit (on the left operand) on each level of recursion and adds one digit more to the right part of the right operand: (?<r> \d \k<r>) re-defines the r capturing group and adds another digit to the already existing capture (referenced with \k<r>).

Thus, as this recurses as long as there are digits on the left operand and exactly one digit is added to the r capturing group per level of recursion, that capturing group will contain exactly as many characters as there were digits on the left operand.

Now, at the end of the recursion (i.e. when the separator + is reached), we can go straight to the correct offset via \d*(?:1(?<r1>)|0)\k<r>, as the searched digit will now be exactly the digit before what the capturing group r matched.

Having now also the r1 flag conditionally set, we can go to the end, checking the result for correctness with simple conditionals: (?(r1) (?(l1) 0 | 1) | (?(l1) 1 | 0).

Given this, it is trivial to extend this to 1***+****0***=****1***:

(?<fundamentalrecursedigit>
  \+ \d*(?:1(?<r1>)|0)\k<r> = \d*(?(r1) (?(l1) 0 | 1) | (?(l1) 1 | 0) )\k<f> \b
| (?=\d* + \d*(?<r>\d\k<r>) = \d*(?<f>\d\k<f>)\b) \d (?&fundamentalrecursedigit)
)
(?<fundamentalcheckdigit>
  (?:1(?<l1>)|0) (?<r>) (?<f>) (?&fundamentalrecursedigit)
)

Run Code Online (Sandbox Code Playgroud)

by using the exactly same trick, collecting also the right part of the result in an f capturing group and accessing the offset right before what this capturing group f matched.

Let's add support for a carry, which really is just setting the cr and cl flags whether the next digit was 1 via (?=(?<cr/cl>1)?) after the current digits of the left and right operands:

(?<carryrecursedigit>
  \+ \d* (?:1(?<r1>)|0) (?=(?<cr>1)?) \k<r> = \d* (?&digitadd) \k<f> \b
| (?=\d* + \d*(?<r>\d\k<r>) = \d*(?<f>\d\k<f>)\b) \d (?&carryrecursedigit)
)
(?<carrycheckdigit>
  (?:1(?<l1>)|0) (?=(?<cl>1)?) (?<r>) (?<f>) (?&carryrecursedigit)
)

Run Code Online (Sandbox Code Playgroud)

Checking inputs of equal lengths

Now, we could be done here if we were to pad all the inputs with enough zeroes:

\b
(?=(?<iteratedigits> (?=(?&carrycheckdigit)) \d (?:\b|(?&iteratedigits)) ))
[01]+ (?&add) [01]+ (?&eq) [01]+
\b

Run Code Online (Sandbox Code Playgroud)

i.e. assert for each digit of the left operand recursively that the addition can be performed and then succeed.

But obviously, we aren't done yet. What about:

the left operand being longer than the right?
the right operand being longer than the left?
the left operand being as long or longer than the right operand and the result having a carry at the most significant digit (i.e. needs a leading 1)?

Handle a left operand longer than the right one

That one is pretty trivial, just stop trying to add digits to the r capturing group when we have captured it completely, set a flag (here: ro) to not consider it eligible for carry anymore and make a digit leading r optional (by (?:\d* (?:1(?<r1>)|0))?):

(?<recursedigit>
  \+ (?:\d* (?:1(?<r1>)|0))? (?(ro)|(?=(?<cr>1)?)) \k<r> = \d* (?&digitadd) \k<f> \b
| (?=\d* + (?:\k<r>(?<ro>)|\d*(?<r>\d\k<r>)) = \d*(?<f>\d\k<f>)\b) \d (?&recursedigit)
)
(?<checkdigit>
  (?:1(?<l1>)|0) (?=(?<cl>1)?) (?<r>) (?<f>) (?&recursedigit)
)

Run Code Online (Sandbox Code Playgroud)

This is now handling the right operand as if it were zero-padded; r1 and cr now just never will be set after that point. Which is all what we need.

It might be easy to be confused here why we can set the ro flag immediately when exceeding the right operators length and immediately ignore the carry; the reason is checkdigit already consuming the first digit at the current position, thus we are actually already more than one digit past the end of the right operand.

The right operand happens to be longer than the left

This is now a bit harder. We can't cram it into recursedigit as that one will just iterate as often as there are digits in the left operand. Thus, we need a separate match for that.

There are now a few cases to consider:

There is a carry from the addition of the most significant digit of the left operand
There is no carry.

For the former case we want to match 10 + 110 = 1000, 11 + 101 = 1000; for the latter case we want to match 1 + 10 = 11 or 1 + 1000 = 1001.

To simplify our task, we're going to ignore leading zeroes for now. Then we know that the most significant digit is 1. There is now no carry only and only if:

the digit at the current offset (i.e. the offset of the most significant digit of the left operand) in the right operand is 0
and there was no carry from the previous offset, meaning the digit at the current in the result is 1.

This translates into the following assertion:

\d+(?=0)\k<r> (?&eq) \d*(?=1)\k<f>\b

Run Code Online (Sandbox Code Playgroud)

In that case, we can capture the first \d+ with (?<remaining>\d+) and require it to be in front of \k<f> (the part on the right of the current offset of the result):

(?<remaining>\d+)\k<r> (?&eq) \k<remaining>\k<f>\b

Run Code Online (Sandbox Code Playgroud)

Otherwise, if there is a carry, we need to increment the left part of the right operand:

(?<carryoverflow>
  (?<x>\d+) 0 (?<y> \k<r> (?&eq) \k<x>1 | 1(?&y)0 )
| (?<z> 1\k<r> (?&eq) 10 | 1(?&z)0 )
)

Run Code Online (Sandbox Code Playgroud)

and use it as:

(?&carryoverflow)\k<f>\b

Run Code Online (Sandbox Code Playgroud)

This carryoverflow capturing group works by copying the left part of the right operand, finding the last zero there and replacing all ones less significant than that zero by zeroes and the zero by an one. In case there is no zero in that part, the ones are simply all replaced by a zero and a leading one is added.

Or to express it less wordily (with n being arbitrary, so that it fits):

  (?<x>\d+) 0 1^n \k<r> (?&eq) \k<x> 1 0^n \k<f>\b
| 1^n \k<r> (?&eq) 1 0^n \k<f>\b

Run Code Online (Sandbox Code Playgroud)

So, let's apply our usual technique to figure out the parts on the right of the operands:

(?<recurselongleft>
  (?&add) (?:(?<remaining>\d+)(?=(?=0)\k<r> (?&eq) \d*(?=1)\k<f>\b)\k<r> (?&eq) (*PRUNE) \k<remaining>\k<f>\b
            | (?&carryoverflow)\k<f>\b)
| (?=\d* (?&add) \d*(?<r>\d\k<r>) (?&eq) \d*(?<f>\d\k<f>)\b) \d(?&recurselongleft)
)

Run Code Online (Sandbox Code Playgroud)

Note that I've added a (*PRUNE) after (?&eq) in the first branch in order to avoid backtracking into the second branch with carryoverflow, in case there happens to be no carry and the result not matching.

Note: We do not do any checks against the \k<f> part here, this is managed by the carrycheckdigit capturing group from above.

The case of the leading 1

We sure do not want 1 + 1 = 0 to be matched. Which it is though if we go by checkdigit alone. So, there are different circumstances when that leading 1 is necessary (if not covered yet by the previous case of the right operand being longer):

Both operands (being without leading zeroes) are of equal length (i.e. they both have an 1 at their most significant digit, which, added together, leaves a carry)
The left operand is longer and there is a carry at the most significant digit or both strings are just as long.

The former case handles inputs like 10 + 10 = 100, the second case handles 110 + 10 = 1000 as well as 1101 + 11 = 10100 and the last one trivially 111 + 10 = 1001.

The first case can be handled by setting a flag ro whether the left operand is longer than the right one, which then can be checked against at the end of recursion:

(?<recurseequallength>
  (?&add) \k<r> (?&eq) (?(ro)|1)\k<f>\b
| (?=\d* (?&add) (?:\k<r>(?<ro>) | \d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b) \d(?&recurseequallength)
)

Run Code Online (Sandbox Code Playgroud)

The second case means we just need to check for the existence of a carry in case of ro (i.e. the right operand being shorter). The carry can as usually be checked (as the most significant digit is always 1 and the right operands digit is implicitly 0 then) with a trivial (?:1(?=0))?\k<f>\b - if there was a carry the digit at the current offset in the result will be 0.

That's easily possible, as, after all, all the other digits up to the current offset will all be verified by checkdigit before. Hence we could just check the carry locally here.

Thus we can add this to the first branch of the recurseequallength alternation:

(?<recurseoverflow>
  (?&add) (?(rlast) \k<r> (?&eq) (?(ro)(?:1(?=0))?|1)\k<f>\b
                | (?:(?<remaining>\d+)(?=0\d* (?&eq) \d*(?=1)\k<f>\b)\k<r> (?&eq) (*PRUNE) \k<remaining>\k<f>\b
                   | (?&carryoverflow)\k<f>\b))
| (?=\d* (?&add) (?:\k<r>(?<ro>)|(?=(?:\d\k<r>(?&eq)(?<rlast>))?)\d*(?<r>\d\k<r>)) (?&eq) \d*(?<f>\d\k<f>)\b)
  \d(?&recurseoverflow)
)

Run Code Online (Sandbox Code Playgroud)

Then to wire everything together: first check checkdigit for each individual digit (same as for the simple zero-padded case from before) and then initialize the different capturing groups used by recurseoverflow:

\b
(?=(?<iteratedigits> (?=(?&checkdigit))\d (?:\b|(?&iteratedigits)) ))
(?=[01]+ (?&add) [01]+ (?&eq) [01]+ \b)
(?<r>) (?<f>) (?&recurseoverflow)
\b

Run Code Online (Sandbox Code Playgroud)

What about the zeroes?

0 + x = x and x + 0 = x are still unhandled and will fail.

Instead of hacking out big capturing groups to handle it uglily, we resort to handling them manually:

(0*? (?<arg>[01]+) (?&add) 0+ | 0+ (?&add) 0*? (?<arg>[01]+)) (?&eq) 0* \k<arg>

Run Code Online (Sandbox Code Playgroud)

Note: When using in alternation with the main branch, we need to put a (*PRUNE) after the (?&eq) in order to avoid jumping into that main branch when any operand is zero and the match failing.

Now, we also always assumed there were no leading zeroes in the inputs for simplifying our expressions. If you look at the initial regex, you will find many occurrences of 0* and 0*+ (possessive to avoid backtracking into it and … unexpected things happening) in order to skip the leading zeroes as we are assuming at some places that the leftmost digit is an 1.

Conclusion

And that's it. We achieved matching only correct additions of binary numbers.

A small note about the relatively new J flag: it allows to redefine capturing groups. This is in first place important in order to initialize the capturing groups to an empty value. Additionally it simplifies some conditionals (like addone) as we only have to check for one value instead of two. Compare (?(a) ... | ...) vs. (?(?=(?(a)|(?(b)|(*F)))) ... | ...). Also, it is not possible to reorder the multiply defined capturing groups arbitrarily inside the (?(DEFINE) ...) construct.

Final note: Binary addition is not a Chomsky type 3 (i.e. regular) language. This is a PCRE specific answer, using PCRE specific features. [Other regex flavors like .NET may be able to solve it too, but not all may.]

If there are any questions, please leave a comment, I'll then try to clarify this in this answer.

@ Qwerp-Derp非常感谢您注意,我已经向PCRE错误跟踪器报告了一个错误:https://bugs.exim.org/show_bug.cgi？id = 1887 (3认同)

归档时间：	9 年，4 月前
查看次数：	1409 次
最近记录：	9 年，1 月前