如何使用NLTK for Persian的功能?
例如:'concordance'.当我使用'concordance'时,答案是"不匹配",但在我的文本中有一致的参数.
输入非常简单.它包含"helloسلام".当'concordance'的参数为'hello'时,答案是正确的,但是,如果它是'سلام',答案是'不匹配'.我的预期输出是'显示1的1匹配'.
import nltk
from urllib import urlopen
url = "file:///home/.../1.html"
raw = urlopen(url).read()
raw = nltk.clean_html(raw)
tokens = nltk.word_tokenize(raw)
tokens = tokens[:12]
text = nltk.Text(tokens)
print text.concordance('????')
Run Code Online (Sandbox Code Playgroud)
alv*_*vas 28
强烈推荐用于NLP的python波斯语库:https://github.com/sobhe/hazm
用法:
>>> from __future__ import unicode_literals
>>> from hazm import Normalizer
>>> normalizer = Normalizer()
>>> normalizer.normalize('????? ????? ?? ? ??????? ?? ????????? ?????? ?? ???? ?? ???')
'????? ???????? ? ??????? ?? ????????? ?????? ?? ???? ??????'
>>> from hazm import sent_tokenize, word_tokenize
>>> sent_tokenize('?? ?? ???? ??? ???? ?????! ??? ???? ??????? ??? ???? ?????')
['?? ?? ???? ??? ???? ?????!', '??? ???? ??????? ??? ???? ?????']
>>> word_tokenize('??? ???? ??????? ??? ???? ?????')
['???', '????', '??????', '?', '???', '????', '????', '?']
>>> from hazm import Stemmer, Lemmatizer
>>> stemmer = Stemmer()
>>> stemmer.stem('???????')
'????'
>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('??????')
'???#??'
>>> from hazm import POSTagger
>>> tagger = POSTagger()
>>> tagger.tag(word_tokenize('?? ????? ???? ?????????'))
[('??', 'PR'), ('?????', 'ADV'), ('????', 'N'), ('?????????', 'V')]
>>> from hazm import DependencyParser
>>> parser = DependencyParser(tagger=POSTagger())
>>> parser.parse(word_tokenize('?????? ???? ?? ?? ??? ?????????'))
<DependencyGraph with 8 nodes>
Run Code Online (Sandbox Code Playgroud)