RAV*_*AVI 5 python regex nlp tokenize nltk
我正在尝试使用RegexpTokenizer对文本进行标记.
码:
from nltk.tokenize import RegexpTokenizer
#from nltk.tokenize import word_tokenize
line = "U.S.A Count U.S.A. Sec.of U.S. Name:Dr.John Doe J.Doe 1.11 1,000 10--20 10-20"
pattern = '[\d|\.|\,]+|[A-Z][\.|A-Z]+\b[\.]*|[\w]+|\S'
tokenizer = RegexpTokenizer(pattern)
print tokenizer.tokenize(line)
#print word_tokenize(line)
Run Code Online (Sandbox Code Playgroud)
输出:
['U','.','S','.','A','计数','U','.','S','.','A','.',' Sec','.','of','U','.','S','.','Name',':','Dr','.','John','Doe' ,'J','.','Doe','1.11','1,000','10',' - ',' - ','20','10',' - ','20']
预期产出:
['USA','Count','USA','Sec','.','of','US','Name',':','Dr','.','John',' Doe','J'','Doe','1.11','1,000','10',' - ',' - ','20','10',' - ','20']
为什么tokenizer也会掠过我预期的代币"USA","US"?我该如何解决这个问题?
我的正则表达式:https://regex101.com/r/dS1jW9/1
关键是你\b
是一个退格符,你需要使用一个原始的字符串文字.此外,您在字符类中有文字管道,这也会弄乱您的输出.
这按预期工作:
>>> pattern = r'[\d.,]+|[A-Z][.A-Z]+\b\.*|\w+|\S'
>>> tokenizer = RegexpTokenizer(pattern)
>>> print(tokenizer.tokenize(line))
['U.S.A', 'Count', 'U.S.A.', 'Sec', '.', 'of', 'U.S.', 'Name', ':', 'Dr', '.', 'John', 'Doe', 'J.', 'Doe', '1.11', '1,000', '10', '-', '-', '20', '10', '-', '20']
Run Code Online (Sandbox Code Playgroud)
请注意,将单个\w
放入字符类是没有意义的.此外,您不需要在字符类逃逸每个非字字符(如一个点),因为它们大多为文字字符有处理(只^
,]
,-
以及\
需要特别注意).