NLTK - 自动翻译相似的单词

use*_*931 7 python algorithm nltk wordnet gensim

全局目标:我正在使用NLTK和Gensim在Python中制作LDA产品评论模型.我想在不同的n-gram上运行它.

问题:对于unigrams来说,一切都很好,但是当我使用bigrams时,我会开始通过重复信息来获取主题.例如,主题1可能包含:['good product', 'good value'],而主题4可能包含:['great product', 'great value'].为了人类,这些都明显传达同样的信息,但显然'good product''great product'是不同的双字母组.我如何在算法上确定它'good product'并且'great product'足够相似,因此我可以将其中一个出现的所有出现转换为另一个(可能是在语料库中更常出现的那个)?

我尝试了什么:我玩过WordNet的Synset树,运气不佳.事实证明,这good是一个"形容词",但它great是一个"形容词卫星",因此返回None路径相似性.我的思考过程是做以下事情:

  1. 词性标记句子
  2. 使用这些POS可以找到正确的Synset
  3. 计算两个Synset的相似性
  4. 如果它们高于某个阈值,则计算两个单词的出现次数
  5. 用最常出现的单词替换最少出现的单词

理想的情况是,我想一个算法,可以确定goodgreat类似在我的文集(也许在共现的意义上),所以它可以扩展到不属于普通英语语言的一部分的话,但出现在我的文集,所以它可以扩展到n克(也许Oracleterrible我的文集是同义的,或feature engineeringfeature creation类似).

有关算法的建议,或建议使WordNet synset行为?

alv*_*vas 2

如果您要使用 WordNet,那么您已经

问题1:词义消歧(WSD),即如何自动确定使用哪个同义词集?

>>> for i in wn.synsets('good','a'):
...     print i.name, i.definition
... 
good.a.01 having desirable or positive qualities especially those suitable for a thing specified
full.s.06 having the normally expected amount
good.a.03 morally admirable
estimable.s.02 deserving of esteem and respect
beneficial.s.01 promoting or enhancing well-being
good.s.06 agreeable or pleasing
good.s.07 of moral excellence
adept.s.01 having or showing knowledge and skill and aptitude
good.s.09 thorough
dear.s.02 with or in a close or intimate relationship
dependable.s.04 financially sound
good.s.12 most suitable or right for a particular purpose
good.s.13 resulting favorably
effective.s.04 exerting force or influence
good.s.15 capable of pleasing
good.s.16 appealing to the mind
good.s.17 in excellent physical condition
good.s.18 tending to promote physical well-being; beneficial to health
good.s.19 not forged
good.s.20 not left to spoil
good.s.21 generally admired

>>> for i in wn.synsets('great','a'):
...     print i.name, i.definition
... 
great.s.01 relatively large in size or number or extent; larger than others of its kind
great.s.02 of major significance or importance
great.s.03 remarkable or out of the ordinary in degree or magnitude or effect
bang-up.s.01 very good
capital.s.03 uppercase
big.s.13 in an advanced stage of pregnancy
Run Code Online (Sandbox Code Playgroud)

假设您以某种方式获得了正确的感觉,也许您尝试过类似的操作(https://github.com/alvations/pywsd),并且假设您获得了正确的 POS 和 synset:

好.a.01 具有理想的或积极的品质,特别是那些适合指定的事物 伟大.s.01 尺寸或数量或范围相对较大;比其他同类产品更大

问题 2:您将如何比较这 2 个同义词集?

让我们尝试一下相似性函数,但您意识到它们不会给您评分:

>>> good = wn.synsets('good','a')[0]
>>> great = wn.synsets('great','a')[0]
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))
None
>>> print max(wn.wup_similarity(good,great), wn.wup_similarity(great, good))

>>> print max(wn.res_similarity(good,great,semcor_ic), wn.res_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1312, in res_similarity
    return synset1.res_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 738, in res_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.jcn_similarity(good,great,semcor_ic), wn.jcn_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1316, in jcn_similarity
    return synset1.jcn_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 759, in jcn_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lch_similarity(good,great), wn.lch_similarity(great, good))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1304, in lch_similarity
    return synset1.lch_similarity(synset2, verbose, simulate_root)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 638, in lch_similarity
    (self, other))
nltk.corpus.reader.wordnet.WordNetError: Computing the lch similarity requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
Run Code Online (Sandbox Code Playgroud)

让我们尝试一对不同的同义词集,因为good同时具有satellite-adjective和 ,adjectivegreat只有satellite,让我们使用最小公分母:

good.s.13 resulting favorably
great.s.01 relatively large in size or number or extent; larger than others of its kind
Run Code Online (Sandbox Code Playgroud)

您意识到仍然没有可用于比较的相似信息satellite-adjective

>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1645, in _lcs_ic
    ic1 = information_content(synset1, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1666, in information_content
    raise WordNetError(msg % synset.pos)
nltk.corpus.reader.wordnet.WordNetError: Information content file has no entries for part-of-speech: s
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))None
None
Run Code Online (Sandbox Code Playgroud)

现在看来 WordNet 制造的问题比它解决的问题还要多,让我们尝试另一种方法,让我们尝试词聚类,请参阅http://en.wikipedia.org/wiki/Word-sense_induction

这时我也放弃了回答OP发布的广泛且开放的问题,因为在聚类中做了很多工作,这对于像我这样的凡人来说是自动的=)