NLTK for Persian

ikj*_*ikj 13 python nlp nltk

如何使用NLTK for Persian的功能?

例如:'concordance'.当我使用'concordance'时,答案是"不匹配",但在我的文本中有一致的参数.

输入非常简单.它包含"helloسلام".当'concordance'的参数为'hello'时,答案是正确的,但是,如果它是'سلام',答案是'不匹配'.我的预期输出是'显示1的1匹配'.

    import nltk
    from urllib import urlopen
    url = "file:///home/.../1.html"
    raw = urlopen(url).read()
    raw = nltk.clean_html(raw)
    tokens = nltk.word_tokenize(raw)
    tokens = tokens[:12]
    text = nltk.Text(tokens)
    print text.concordance('????')
Run Code Online (Sandbox Code Playgroud)

alv*_*vas 28

强烈推荐用于NLP的python波斯语库:https://github.com/sobhe/hazm

用法:

>>> from __future__ import unicode_literals

>>> from hazm import Normalizer
>>> normalizer = Normalizer()
>>> normalizer.normalize('????? ????? ?? ? ??????? ?? ????????? ?????? ?? ???? ?? ???')
'????? ???????? ? ??????? ?? ????????? ?????? ?? ???? ??????'

>>> from hazm import sent_tokenize, word_tokenize
>>> sent_tokenize('?? ?? ???? ??? ???? ?????! ??? ???? ??????? ??? ???? ?????')
['?? ?? ???? ??? ???? ?????!', '??? ???? ??????? ??? ???? ?????']
>>> word_tokenize('??? ???? ??????? ??? ???? ?????')
['???', '????', '??????', '?', '???', '????', '????', '?']

>>> from hazm import Stemmer, Lemmatizer
>>> stemmer = Stemmer()
>>> stemmer.stem('???????')
'????'
>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('??????')
'???#??'

>>> from hazm import POSTagger
>>> tagger = POSTagger()
>>> tagger.tag(word_tokenize('?? ????? ???? ?????????'))
[('??', 'PR'), ('?????', 'ADV'), ('????', 'N'), ('?????????', 'V')]

>>> from hazm import DependencyParser
>>> parser = DependencyParser(tagger=POSTagger())
>>> parser.parse(word_tokenize('?????? ???? ?? ?? ??? ?????????'))
<DependencyGraph with 8 nodes>
Run Code Online (Sandbox Code Playgroud)