正确日语句子分词器的正则表达式-python

alv*_*vas 3 python regex tokenize nltk

这是我拥有的当前文本,但正则表达式不正确,无法拆分句子更正。请帮助纠正我的正则表达式,谢谢。

import nltk
import os, sys, re, glob
from nltk.tokenize import RegexpTokenizer

jp_sent_tokenizer = nltk.RegexpTokenizer(u'[^??????]*[???]')

para = []
para.append (jp_sent_tokenizer.tokenize(u' ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????otak otak ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? ')

for index in range(len(para[0])):
          print para[0][index]
          print 'this is eos'
          #print line
print 'this is eop'
Run Code Online (Sandbox Code Playgroud)

我得到这个输出:

??????????????????????????????????????????
this is eos
????????????????????????????????????????????????????????????????????????????
this is eos
???????????????????????
this is eos
???????
this is eos
??????????????????????????????????
this is eos
??????????????????????????????????????????????????????????
this is eos
this is eop
Run Code Online (Sandbox Code Playgroud)

正确的输出应该是这样的:

 ??????????????????????????????????????????????
this is eos
????????????????????????????????????????????????????????????????????????????
this is eos
???????????????????????
this is eos
??????otak otak ???????????????????
this is eos
??????otak otak ?????????????????????????????????????????????????????????????
this is eos
?????????????????????????????????????????????????????????? 
this is eos
this is eop
Run Code Online (Sandbox Code Playgroud)

Kob*_*obi 5

尝试这个:

u'[^???]*[???]'
Run Code Online (Sandbox Code Playgroud)

看起来引号 (??) 确实属于句子,所以你想允许它们。

我应该警告说,一般来说(好吧,在英语语法中),解析整个当前句子非常困难(甚至不可能)。(考虑1.2Dr. Fleishman等等)