gui*_*ier 1 regex parsing pcre
我正在研究gettext javascript解析器,而且我仍然坚持使用解析正则表达式.
我需要赶上传递给特定的方法调用每个参数_n(和_(.例如,如果我在我的javascript文件中有这些:
_("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
Run Code Online (Sandbox Code Playgroud)
这引用了这个文档:http://poedit.net/trac/wiki/Doc/Keywords
我打算两次(和两个正则表达式)这样做:
_n(或_(方法调用基本上,我想要一个正则表达式"可以说" 在函数完成后捕获所有内容_n(或者在最后一个括号中_(停止.我不知道是否可以使用正则表达式并且没有javascript解析器.)
还可以做的是"捕获每个"字符串"或'字符串'之后_n(或者_(在行的末尾或者在新的_n(或_(字符的开头处停止.
在我完成的所有事情中,我要么坚持_( "one (optional)" );使用其内部括号,要么apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)在同一行上进行两次调用.
这是我到目前为止实现的,具有不完美的正则表达式:通用解析器和javascript one或者把手一个
注意: 如果您不熟悉递归,请阅读此答案.
谁说正则表达式不能模块化?好PCRE正则表达救援!
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
Run Code Online (Sandbox Code Playgroud)
该s是与匹配换行.和x修改是这个奇特间距和我们的正则表达式的评论.
Online regex demo Online php demo
由于我们的正则表达式也将获得开始和结束括号(),我们可能需要过滤它们.我们将preg_replace()在结果上使用:
~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x
Run Code Online (Sandbox Code Playgroud)
所以这是另一个模块化正则表达式,你甚至可以添加自己的语法:
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&variable)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
~xis
Run Code Online (Sandbox Code Playgroud)
我们将循环使用preg_match_all().最终的代码如下所示:
$functionPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
regex;
$argumentsPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
|
(?&variable)
~six
regex;
$input = <<<'input'
_ ("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
// misleading cases
_n("foo (")
_n("foo (\)", 'foo)', aa)
_n( Array(1, 2, 3), Array(")", '(') );
_n(function(foo){return foo*2;}); // Is this even valid?
_n (); // Empty
_ (
"Foo",
'Bar',
Array(
"wow",
"much",
'whitespaces'
),
multiline
); // PCRE is awesome
input;
if(preg_match_all($functionPattern, $input, $m)){
$filtered = preg_replace(
'~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x', // Regex
'', // Replace with nothing
$m['results'] // Subject
); // Getting rid of opening & closing brackets
// Part 3: extract arguments:
$parsedTree = array();
foreach($filtered as $arguments){ // Loop
if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => $m[0]
); // Add an array to our tree and fill it
}else{
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => array()
); // Add an array with empty branches
}
}
print_r($parsedTree); // Let's see the results;
}else{
echo 'no matches';
}
Run Code Online (Sandbox Code Playgroud)
您可能希望创建一个递归函数来生成完整的树.看到这个答案.
您可能会注意到function(){}部件未正确解析.我会把它作为读者的练习:)