以下是我多年前写的perl函数.它是一个智能标记器,它可以识别一些可能不应该粘在一起的事物.例如,给定左侧的输入,它将分割字符串,如右图所示:
Run Code Online (Sandbox Code Playgroud)abc123 -> abc|123 abcABC -> abc|ABC ABC123 -> ABC|123 123abc -> 123|abc 123ABC -> 123|ABC AbcDef -> Abc|Def (e.g. CamelCase) ABCDef -> ABC|Def 1stabc -> 1st|abc (recognize valid ordinals) 1ndabc -> 1|ndabc (but not invalid ordinals) 11thabc -> 11th|abc (recognize that 11th - 13th are different than 1st - 3rd) 11stabc -> 11|stabc
我现在正在做一些机器学习实验,我想做一些使用这个标记器的实验.但首先,我需要将它从Perl移植到Python.这段代码的关键是使用\ G锚点的循环,我听到的东西在python中不存在.我已经尝试使用谷歌搜索如何在Python中完成,但我不确定究竟要搜索什么,所以我很难找到答案.
你会如何在Python中编写这个函数?
sub Tokenize
# Breaks a string into tokens using special rules,
# where a token is any sequence of characters, be they a sequence of letters,
# a sequence of numbers, or a sequence of non-alpha-numeric characters
# the list of tokens found are returned to the caller
{
my $value = shift;
my @list = ();
my $word;
while ( $value ne '' && $value =~ m/
\G # start where previous left off
([^a-zA-Z0-9]*) # capture non-alpha-numeric characters, if any
([a-zA-Z0-9]*?) # capture everything up to a token boundary
(?: # identify the token boundary
(?=[^a-zA-Z0-9]) # next character is not a word character
| (?=[A-Z][a-z]) # Next two characters are upper lower
| (?<=[a-z])(?=[A-Z]) # lower followed by upper
| (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
# ordinal boundaries
| (?<=^1(?i:st)) # first
| (?<=[^1][1](?i:st)) # first but not 11th
| (?<=^2(?i:nd)) # second
| (?<=[^1]2(?i:nd)) # second but not 12th
| (?<=^3(?i:rd)) # third
| (?<=[^1]3(?i:rd)) # third but not 13th
| (?<=1[123](?i:th)) # 11th - 13th
| (?<=[04-9](?i:th)) # other ordinals
# non-ordinal digit-letter boundaries
| (?<=^1)(?=[a-zA-Z])(?!(?i)st) # digit-letter but not first
| (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st) # digit-letter but not 11th
| (?<=^2)(?=[a-zA-Z])(?!(?i)nd) # digit-letter but not first
| (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd) # digit-letter but not 12th
| (?<=^3)(?=[a-zA-Z])(?!(?i)rd) # digit-letter but not first
| (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd) # digit-letter but not 13th
| (?<=1[123])(?=[a-zA-Z])(?!(?i)th) # digit-letter but not 11th - 13th
| (?<=[04-9])(?=[a-zA-Z])(?!(?i)th) # digit-letter but not ordinal
| (?=$) # end of string
)
/xg )
{
push @list, $1 if $1 ne '';
push @list, $2 if $2 ne '';
}
return @list;
}
Run Code Online (Sandbox Code Playgroud)
我确实尝试使用re.split()以及上面的变体.但是,split()拒绝在零宽度匹配上进行拆分(如果一个人真正知道自己在做什么,这种能力应该是可能的).
我确实提出了这个特定问题的解决方案,但没有解决"如何使用\基于G的解析"的一般问题 - 我有一些示例代码在循环中使用\ G然后在循环中进行正则表达式它使用另一个锚定在\ G的匹配来查看继续解析的方法.所以我还在寻找答案.
也就是说,这是我将上述内容翻译成Python的最终工作代码:
import re
IsA = lambda s: '[' + s + ']'
IsNotA = lambda s: '[^' + s + ']'
Upper = IsA( 'A-Z' )
Lower = IsA( 'a-z' )
Letter = IsA( 'a-zA-Z' )
Digit = IsA( '0-9' )
AlphaNumeric = IsA( 'a-zA-Z0-9' )
NotAlphaNumeric = IsNotA( 'a-zA-Z0-9' )
EndOfString = '$'
OR = '|'
ZeroOrMore = lambda s: s + '*'
ZeroOrMoreNonGreedy = lambda s: s + '*?'
OneOrMore = lambda s: s + '+'
OneOrMoreNonGreedy = lambda s: s + '+?'
StartsWith = lambda s: '^' + s
Capture = lambda s: '(' + s + ')'
PreceededBy = lambda s: '(?<=' + s + ')'
FollowedBy = lambda s: '(?=' + s + ')'
NotFollowedBy = lambda s: '(?!' + s + ')'
StopWhen = lambda s: s
CaseInsensitive = lambda s: '(?i:' + s + ')'
ST = '(?:st|ST)'
ND = '(?:nd|ND)'
RD = '(?:rd|RD)'
TH = '(?:th|TH)'
def OneOf( *args ):
return '(?:' + '|'.join( args ) + ')'
pattern = '(.+?)' + \
OneOf(
# ABC | !!! - break at whitespace or non-alpha-numeric boundary
PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ),
PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ),
# ABC | Abc - break at what looks like the start of a word or sentence
FollowedBy( Upper + Lower ),
# abc | ABC - break when a lower-case letter is followed by an upper case
PreceededBy( Lower ) + FollowedBy( Upper ),
# abc | 123 - break between words and digits
PreceededBy( Letter ) + FollowedBy( Digit ),
# 1st | oak - recognize when the string starts with an ordinal
PreceededBy( StartsWith( '1' + ST ) ),
PreceededBy( StartsWith( '2' + ND ) ),
PreceededBy( StartsWith( '3' + RD ) ),
# 1st | abc - contains an ordinal
PreceededBy( IsNotA( '1' ) + '1' + ST ),
PreceededBy( IsNotA( '1' ) + '2' + ND ),
PreceededBy( IsNotA( '1' ) + '3' + RD ),
PreceededBy( '1' + IsA( '123' ) + TH ),
PreceededBy( IsA( '04-9' ) + TH ),
# 1 | abcde - recognize when it starts with or contains a non-ordinal digit/letter boundary
PreceededBy( StartsWith( '1' ) ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
PreceededBy( '1' + IsA( '123' ) ) + FollowedBy( Letter ) + NotFollowedBy( TH ),
PreceededBy( IsA( '04-9' ) ) + FollowedBy( Letter ) + NotFollowedBy( TH ),
# abcde | $ - end of the string
FollowedBy( EndOfString )
)
matcher = re.compile( pattern )
def tokenize( s ):
return matcher.findall( s )
Run Code Online (Sandbox Code Playgroud)
\G在正则表达式的开头进行模拟re.RegexObject.match您可以通过跟踪 并将起始位置提供给 来模拟正则\G表达式开头的效果,这会强制匹配从 中的指定位置开始。rere.RegexObject.matchpos
def tokenize(w):
index = 0
m = matcher.match(w, index)
o = []
# Although index != m.end() check zero-length match, it's more of
# a guard against accidental infinite loop.
# Don't expect a regex which can match empty string to work.
# See Caveat section.
while m and index != m.end():
o.append(m.group(1))
index = m.end()
m = matcher.match(w, index)
return o
Run Code Online (Sandbox Code Playgroud)
此方法需要注意的是,它不能很好地与主匹配中匹配空字符串的正则表达式配合使用,因为 Python 没有任何工具可以强制正则表达式重试匹配,同时防止零长度匹配。
例如,re.findall(r'(.??)', 'abc')返回 4 个空字符串的数组['', '', '', ''],而在 PCRE 中,您可以找到 7 个匹配项['', 'a', '', 'b', '', 'c' ''],其中第二个、第四个和第六个匹配项分别与第一个、第三个和第五个匹配项的索引相同。PCRE 中的其他匹配是通过使用防止空字符串匹配的标志在相同索引处重试来找到的。
我知道问题是关于 Perl,而不是 PCRE,但全局匹配行为应该是相同的。否则,原来的代码就无法工作。
正如问题中所做的那样,重写([^a-zA-Z0-9]*)([a-zA-Z0-9]*?)为(.+?)可以避免此问题,尽管您可能想使用re.Sflag。
由于Python中的不区分大小写标志会影响整个模式,因此必须重写不区分大小写的子模式。我会重写(?i:st)以[sS][tT]保留原始含义,但(?:st|ST)如果这是您要求的一部分,请继续。
由于 Python 支持带有flag 的自由间距模式re.X,因此您可以编写类似于 Perl 代码中的正则表达式:
matcher = re.compile(r'''
(.+?)
(?: # identify the token boundary
(?=[^a-zA-Z0-9]) # next character is not a word character
| (?=[A-Z][a-z]) # Next two characters are upper lower
| (?<=[a-z])(?=[A-Z]) # lower followed by upper
| (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
# ordinal boundaries
| (?<=^1[sS][tT]) # first
| (?<=[^1][1][sS][tT]) # first but not 11th
| (?<=^2[nN][dD]) # second
| (?<=[^1]2[nN][dD]) # second but not 12th
| (?<=^3[rR][dD]) # third
| (?<=[^1]3[rR][dD]) # third but not 13th
| (?<=1[123][tT][hH]) # 11th - 13th
| (?<=[04-9][tT][hH]) # other ordinals
# non-ordinal digit-letter boundaries
| (?<=^1)(?=[a-zA-Z])(?![sS][tT]) # digit-letter but not first
| (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT]) # digit-letter but not 11th
| (?<=^2)(?=[a-zA-Z])(?![nN][dD]) # digit-letter but not first
| (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD]) # digit-letter but not 12th
| (?<=^3)(?=[a-zA-Z])(?![rR][dD]) # digit-letter but not first
| (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD]) # digit-letter but not 13th
| (?<=1[123])(?=[a-zA-Z])(?![tT][hH]) # digit-letter but not 11th - 13th
| (?<=[04-9])(?=[a-zA-Z])(?![tT][hH]) # digit-letter but not ordinal
| (?=$) # end of string
)
''', re.X)
Run Code Online (Sandbox Code Playgroud)