alv*_*vas 3 python regex tokenize nltk
这是我拥有的当前文本,但正则表达式不正确,无法拆分句子更正。请帮助纠正我的正则表达式,谢谢。
import nltk
import os, sys, re, glob
from nltk.tokenize import RegexpTokenizer
jp_sent_tokenizer = nltk.RegexpTokenizer(u'[^??????]*[???]')
para = []
para.append (jp_sent_tokenizer.tokenize(u' ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????otak otak ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? ')
for index in range(len(para[0])):
print para[0][index]
print 'this is eos'
#print line
print 'this is eop'
Run Code Online (Sandbox Code Playgroud)
我得到这个输出:
??????????????????????????????????????????
this is eos
????????????????????????????????????????????????????????????????????????????
this is eos
???????????????????????
this is eos
???????
this is eos
??????????????????????????????????
this is eos
??????????????????????????????????????????????????????????
this is eos
this is eop
Run Code Online (Sandbox Code Playgroud)
正确的输出应该是这样的:
??????????????????????????????????????????????
this is eos
????????????????????????????????????????????????????????????????????????????
this is eos
???????????????????????
this is eos
??????otak otak ???????????????????
this is eos
??????otak otak ?????????????????????????????????????????????????????????????
this is eos
??????????????????????????????????????????????????????????
this is eos
this is eop
Run Code Online (Sandbox Code Playgroud)
尝试这个:
u'[^???]*[???]'
Run Code Online (Sandbox Code Playgroud)
看起来引号 (??) 确实属于句子,所以你想允许它们。
我应该警告说,一般来说(好吧,在英语语法中),解析整个当前句子非常困难(甚至不可能)。(考虑1.2,Dr. Fleishman等等)