jm6*_*666 4 regex unicode perl camelcasing utf-8
这是一个关于CamelCase正则表达式的问题.随着tchrist post的组合,我想知道什么是正确的utf-8 CamelCase.
从(brian d foy's)正则表达式开始:
/
\b # start at word boundary
[A-Z] # start with upper
[a-zA-Z]* # followed by any alpha
(?: # non-capturing grouping for alternation precedence
[a-z][a-zA-Z]*[A-Z] # next bit is lower, any zero or more, ending with upper
| # or
[A-Z][a-zA-Z]*[a-z] # next bit is upper, any zero or more, ending with lower
)
[a-zA-Z]* # anything that's left
\b # end at word
/x
Run Code Online (Sandbox Code Playgroud)
并修改为:
/
\b # start at word boundary
\p{Uppercase_Letter} # start with upper
\p{Alphabetic}* # followed by any alpha
(?: # non-capturing grouping for alternation precedence
\p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter} ### next bit is lower, any zero or more, ending with upper
| # or
\p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter} ### next bit is upper, any zero or more, ending with lower
)
\p{Alphabetic}* # anything that's left
\b # end at word
/x
Run Code Online (Sandbox Code Playgroud)
标有"###"的行有问题.
另外,假设比数字和下划线等同于小写字母时如何修改正则表达式,因此W2X3是一个有效的CamelCase单词.
更新:(ysth评论)
为了下一个,
any:表示"大写或小写或数字或下划线"正则表达式应该与CamelWord,CaW相匹配
请不要标记为重复,因为它不是.在原来的问题(和答案太)认为只有ASCII.
我真的不知道你要做什么,但这应该更接近你原来的意图.不过,我仍然无法说出你的意思.
m{
\b
\p{Upper} # start with uppercase code point (NOT LETTER)
\w* # optional ident chars
# note that upper and lower are not related to letters
(?: \p{Lower} \w* \p{Upper}
| \p{Upper} \w* \p{Lower}
)
\w*
\b
}x
Run Code Online (Sandbox Code Playgroud)
切勿使用[a-z].事实上,不要使用\p{Lowercase_Letter}或\p{Ll},因为那些不是更理想和更正确的\p{Lowercase}和\p{Lower}.
请记住,这\w只是一个别名
[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]
Run Code Online (Sandbox Code Playgroud)