使用Gensim获取三元组的问题

8 python data-mining text-mining gensim word2vec

我想从我提到的例句中得到bigrams和trigrams.

我的代码适用于bigrams.但是,它不捕获数据中的三元组(例如,人工计算机交互,我的句子的5个位置提到)

方法1下面提到的是我在Gensim中使用短语的代码.

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=1, delimiter=b' ')
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)
Run Code Online (Sandbox Code Playgroud)

方法2我甚至尝试使用Phraser和Phrases,但它没有用.

from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)
Run Code Online (Sandbox Code Playgroud)

请帮我解决这个三卦问题.

我正在关注Gensim 的示例文档.

stj*_*iht 8

我能够通过对代码进行一些修改来获取bigrams和trigrams:

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')

for sent in sentence_stream:
    bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1]
    trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2]

    print(bigrams_)
    print(trigrams_)
Run Code Online (Sandbox Code Playgroud)

threshold = 1从bigram中删除了参数,Phrases因为否则它似乎形成了奇怪的数字,允许构造奇怪的三元组(bigram用于构建三元组的注意事项Phrases); 当您有更多数据时,此参数可能会有用.对于trigrams,min_count还需要指定参数,因为如果未提供,则默认为5.

为了检索每个文档的双字母和三元组,您可以使用此列表理解技巧来过滤分别不是由两个或三个单词组成的元素.


编辑 - 有关threshold参数的一些详细信息:

估计器使用此参数来确定两个单词ab是否构成短语,并且仅在以下情况下:

(count(a followed by b) - min_count) * N/(count(a) * count(b)) > threshold
Run Code Online (Sandbox Code Playgroud)

其中N是总词汇量.默认情况下,参数值为10(请参阅文档).因此,越高threshold,单词形成短语的约束就越难.

例如,在你尝试使用的第一种方法中threshold = 1,你会得到['human computer','interaction is']5个句子中的3个以"人机交互"开头的数字; 奇怪的第二个数字是更宽松的阈值的结果.

然后,当你试图获得默认的三元组时,threshold = 10你只能获得['human computer interaction is']这3个句子,而剩下的两个句子则没有(按阈值过滤); 因为这是一个4克而不是三元组,它也会被过滤掉if t.count(' ') == 2.例如,如果您将三元组阈值降低到1,则可以将['人机交互']作为其余两个句子的三元组.获得良好的参数组合似乎并不容易,这里有更多关于它的信息.

我不是专家,所以我不加考虑这个结论:我认为在继续前进之前首先得到好的数据结果(不像'互动是')会更好,因为奇怪的数字会给进一步的三元组增加混乱,4 -公克...

  • 别客气!是的,我编辑了答案,希望现在有点清楚了. (2认同)