我应该如何训练布朗语料库中的gensim

Question

我应该如何训练布朗语料库中的gensim

我正在尝试使用gensim word2vec.我无法训练基于布朗语料库的模型.这是我的代码.

from gensim import models

model = models.Word2Vec([sentence for sentence in models.word2vec.BrownCorpus("E:\\nltk_data\\")],workers=4)
model.save("E:\\data.bin")

Run Code Online (Sandbox Code Playgroud)

我使用下载了nltk_data nltk.download().我收到以下错误.

C:\Python27\lib\site-packages\gensim-0.10.1-py2.7.egg\gensim\models\word2vec.py:401: UserWarning: Cython compilation failed, training will be slow. Do you have Cython installed? `pip install cython`
  warnings.warn("Cython compilation failed, training will be slow. Do you have Cython installed? `pip install cython`")
Traceback (most recent call last):
  File "E:\eclipse_workspace\Python_files\Test\Test.py", line 8, in <module>
    model = models.Word2Vec([sentence for sentence in models.word2vec.BrownCorpus("E:\\nltk_data\\")],workers=4)
  File "C:\Python27\lib\site-packages\gensim-0.10.1-py2.7.egg\gensim\models\word2vec.py", line 276, in __init__
    self.train(sentences)
  File "C:\Python27\lib\site-packages\gensim-0.10.1-py2.7.egg\gensim\models\word2vec.py", line 407, in train
    raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model

Run Code Online (Sandbox Code Playgroud)

我究竟做错了什么？

Answer 1

Jas*_*yne 10

也许你以错误的方式创建句子.
试试这个,它对我有用.

import gensim
import logging
from nltk.corpus import brown    

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = brown.sents()
model = gensim.models.Word2Vec(sentences, min_count=1)
model.save('/tmp/brown_model')

Run Code Online (Sandbox Code Playgroud)

日志部分不是必需的,您可以Word2Vec()根据需要更改参数.

归档时间：	11 年前
查看次数：	2810 次
最近记录：	8 年前