nltk模块中的类似方法在不同的机器上产生不同的结果.为什么?

Dav*_*les 15 python nlp similarity corpus nltk

我已经教过一些用Python进行文本挖掘的入门课程,并且该课程尝试了与提供的练习文本类似的方法.有些学生对text1.similar()的结果与其他学生不同.

所有版本等都是一样的.

有谁知道为什么会出现这些差异?谢谢.

在命令行使用的代码.

python
>>> import nltk
>>> nltk.download() #here you use the pop-up window to download texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet
Run Code Online (Sandbox Code Playgroud)

类似方法返回的那些术语列表因用户而异,它们有许多共同的词,但它们不是相同的列表.所有用户都使用相同的操作系统,以及相同版本的python和nltk.

我希望这会使问题更清楚.谢谢.

b30*_*000 16

在您的示例中,还有40个其他单词与单词具有完全相同的一个上下文'monstrous'.在similar函数中,Counter对象用于计算具有相似上下文的单词,然后打印最常见的单词(默认为20).由于所有40个频率相同,因此订单可能不同.

文档Counter.most_common:

具有相同计数的元素是任意排序的


我用这段代码检查了相似单词的频率(这实际上是功能代码相关部分的副本):

from nltk.book import *
from nltk.util import tokenwrap
from nltk.compat import Counter

word = 'monstrous'
num = 20

text1.similar(word)

wci = text1._word_context_index._word_to_contexts

if word in wci.conditions():
            contexts = set(wci[word])
            fd = Counter(w for w in wci.conditions() for c in wci[w]
                          if c in contexts and not w == word)
            words = [w for w, _ in fd.most_common(num)]
            # print(tokenwrap(words))

print(fd)
print(len(fd))
print(fd.most_common(num))
Run Code Online (Sandbox Code Playgroud)

输出:(不同的运行为我提供不同的输出)

Counter({'doleful': 1, 'curious': 1, 'delightfully': 1, 'careful': 1, 'uncommon': 1, 'mean': 1, 'perilous': 1, 'fearless': 1, 'imperial': 1, 'christian': 1, 'trustworthy': 1, 'untoward': 1, 'maddens': 1, 'true': 1, 'contemptible': 1, 'subtly': 1, 'wise': 1, 'lamentable': 1, 'tyrannical': 1, 'puzzled': 1, 'vexatious': 1, 'part': 1, 'gamesome': 1, 'determined': 1, 'reliable': 1, 'lazy': 1, 'passing': 1, 'modifies': 1, 'few': 1, 'horrible': 1, 'candid': 1, 'exasperate': 1, 'pitiable': 1, 'abundant': 1, 'mystifying': 1, 'mouldy': 1, 'loving': 1, 'domineering': 1, 'impalpable': 1, 'singular': 1})
Run Code Online (Sandbox Code Playgroud)


alv*_*vas 6

简而言之:

它与函数使用Counter字典python3时的哈希键有关similar().见http://pastebin.com/ysAF6p6h

请参阅python2和python3中字典哈希的方式和原因是什么?


长期:

让我们从:

from nltk.book import *
Run Code Online (Sandbox Code Playgroud)

这里的导入来自https://github.com/nltk/nltk/blob/develop/nltk/book.py,它导入nltk.text.Text对象并将几个语料库读入Text对象.

例如,这是text1变量的读取方式nltk.book:

>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
Run Code Online (Sandbox Code Playgroud)

现在,如果我们similar()https://github.com/nltk/nltk/blob/develop/nltk/text.py#L377上查看该函数的代码,我们会看到这个初始化,如果它是第一个访问实例self._word_context_index:

def similar(self, word, num=20):
    """
    Distributional similarity: find other words which appear in the
    same contexts as the specified word; list most similar words first.
    :param word: The word used to seed the similarity search
    :type word: str
    :param num: The number of words to generate (default=20)
    :type num: int
    :seealso: ContextIndex.similar_words()
    """
    if '_word_context_index' not in self.__dict__:
        #print('Building word-context index...')
        self._word_context_index = ContextIndex(self.tokens, 
                                                filter=lambda x:x.isalpha(), 
                                                key=lambda s:s.lower())


    word = word.lower()
    wci = self._word_context_index._word_to_contexts
    if word in wci.conditions():
        contexts = set(wci[word])
        fd = Counter(w for w in wci.conditions() for c in wci[w]
                      if c in contexts and not w == word)
        words = [w for w, _ in fd.most_common(num)]
        print(tokenwrap(words))
    else:
        print("No matches")
Run Code Online (Sandbox Code Playgroud)

因此,我们指向nltk.text.ContextIndex对象,即假设收集具有相似上下文窗口的所有单词并存储它们.文档字符串说:

单词与文本中的"上下文"之间的双向索引.单词的上下文通常被定义为在单词周围的固定窗口中出现的单词; 但是也可以通过提供自定义上下文功能来使用其他定义.

默认情况下,如果您正在调用该similar()函数,它将_word_context_index使用默认上下文设置(即左右标记窗口)初始化,请参阅https://github.com/nltk/nltk/blob/develop/nltk/text.py# L40

@staticmethod
def _default_context(tokens, i):
    """One left token and one right token, normalized to lowercase"""
    left = (tokens[i-1].lower() if i != 0 else '*START*')
    right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*')
    return (left, right)
Run Code Online (Sandbox Code Playgroud)

similar()函数中,我们看到它遍历存储在word_context_index中的上下文中的单词,即wci = self._word_context_index._word_to_contexts.

本质上,_word_to_contexts是一个字典,其中键是语料库中的单词,值是来自https://github.com/nltk/nltk/blob/develop/nltk/text.py#L55的左右单词:

    self._word_to_contexts = CFD((self._key(w), self._context_func(tokens, i))
                                 for i, w in enumerate(tokens))
Run Code Online (Sandbox Code Playgroud)

在这里我们看到它是一个CFD,它是一个nltk.probability.ConditionalFreqDist对象,它不包括令牌概率的平滑,请参阅https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L1646上的完整代码..


得到不同结果的唯一可能是当similar()函数遍历https://github.com/nltk/nltk/blob/develop/nltk/text.py#L402中的most_common单词时

假定Counter对象中的两个键具有相同的计数,则具有较低排序哈希的单词将首先打印出来,并且键的哈希值取决于CPU的位大小,请参阅http://www.laurentluce.com/文章/ python的词典,执行/


找到相似单词本身的整个过程是确定性的,因为:

  • 语料库/输入是固定的 Text(gutenberg.words('melville-moby_dick.txt'))
  • 每个单词的默认上下文也是固定的,即 self._word_context_index
  • 条件频率分布的计算_word_context_index._word_to_contexts是离散的

除非函数输出most_common列表,当Counter值与值相关时,它将输出给定哈希值的键列表.

python2,没有理由使用以下代码从同一台机器的不同实例获取不同的输出:

$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
Run Code Online (Sandbox Code Playgroud)

但是,在Python3每次运行时它会提供不同的输出text1.similar('monstrous'),请参阅http://pastebin.com/ysAF6p6h


这是一个简单的实验来证明python2和之间的古怪散列差异python3:

alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]


alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('barfoo', 1), ('foobar', 1), ('bar', 1), ('foo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foo', 1), ('barfoo', 1), ('bar', 1), ('foobar', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('bar', 1), ('barfoo', 1), ('foobar', 1), ('foo', 1)]
Run Code Online (Sandbox Code Playgroud)

  • 标记b3000,更容易理解问题,想要了解更多的人会向下滚动到这个答案=) (2认同)