NLTK - nltk.tokenize.RegexpTokenizer - 正则表达式无法按预期工作

RAV*_*AVI 5 python regex nlp tokenize nltk

我正在尝试使用RegexpTokenizer对文本进行标记.

码:

from nltk.tokenize import RegexpTokenizer
#from nltk.tokenize import word_tokenize

line = "U.S.A Count U.S.A. Sec.of U.S. Name:Dr.John Doe J.Doe 1.11 1,000 10--20 10-20"
pattern = '[\d|\.|\,]+|[A-Z][\.|A-Z]+\b[\.]*|[\w]+|\S'
tokenizer = RegexpTokenizer(pattern)

print tokenizer.tokenize(line)
#print word_tokenize(line)
Run Code Online (Sandbox Code Playgroud)

输出:

['U','.','S','.','A','计数','U','.','S','.','A','.',' Sec','.','of','U','.','S','.','Name',':','Dr','.','John','Doe' ,'J','.','Doe','1.11','1,000','10',' - ',' - ','20','10',' - ','20']

预期产出:

['USA','Count','USA','Sec','.','of','US','Name',':','Dr','.','John',' Doe','J'','Doe','1.11','1,000','10',' - ',' - ','20','10',' - ','20']

为什么tokenizer也会掠过我预期的代币"USA","US"?我该如何解决这个问题?

我的正则表达式:https://regex101.com/r/dS1jW9/1

Wik*_*żew 8

关键是你\b是一个退格符,你需要使用一个原始的字符串文字.此外,您在字符类中有文字管道,这也会弄乱您的输出.

这按预期工作:

>>> pattern = r'[\d.,]+|[A-Z][.A-Z]+\b\.*|\w+|\S'
>>> tokenizer = RegexpTokenizer(pattern)
>>> print(tokenizer.tokenize(line))

['U.S.A', 'Count', 'U.S.A.', 'Sec', '.', 'of', 'U.S.', 'Name', ':', 'Dr', '.', 'John', 'Doe', 'J.', 'Doe', '1.11', '1,000', '10', '-', '-', '20', '10', '-', '20']
Run Code Online (Sandbox Code Playgroud)

请注意,将单个\w放入字符类是没有意义的.此外,您不需要在字符类逃逸每个非字字符(如一个点),因为它们大多为文字字符有处理(只^,],-以及\需要特别注意).