用中文和英文标记文本不恰当地将英文单词分成字母

Question

用中文和英文标记文本不恰当地将英文单词分成字母

yhy*_*ord 5 nlp tokenize nltk stanford-nlp python-3.x

当标记包含中文和英文的文本时,结果会将英文单词分成字母,这不是我想要的.请考虑以下代码:

from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('?????Melissa Dell'))

Run Code Online (Sandbox Code Playgroud)

输出将是???? ? M e l i s s a D e l l.如何修改此行为？

Answer 1

Sta*_*elp 0

我不能代表nltk，但如果在这句话上运行，Stanford CoreNLP 不会表现出这种行为。

如果您在示例中发出此命令，您将获得正确的标记化：

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file example.txt -outputFormat text

Run Code Online (Sandbox Code Playgroud)

stanza如果您想通过 Python 访问斯坦福 CoreNLP，您可能需要考虑使用。

更多信息在这里： https: //github.com/stanfordnlp/stanza

归档时间：	8 年，5 月前
查看次数：	997 次
最近记录：	8 年，5 月前