Word2vec Gensim 准确度分析

Sam*_*Sam 6 python nlp gensim word2vec

我正在开发一个 NLP 应用程序,其中有一个文本文件语料库。我想使用Gensim word2vec 算法创建词向量。

我做了 90% 的训练和 10% 的测试拆分。我在适当的集上训练了模型,但我想评估模型在测试集上的准确性。

我在互联网上浏览过任何关于准确性评估的文档,但我找不到任何允许我这样做的方法。有谁知道进行精度分析的函数?

我处理测试数据的方式是从测试文件夹中的文本文件中提取所有句子,并将其变成一个巨大的句子列表。在那之后,我使用了一个我认为是正确的函数(事实证明它不是因为它给了我这个错误:TypeError: 不知道如何处理 uri)。这是我如何去做的:

test_filenames = glob.glob('./testing/*.txt')

print("Found corpus of %s safety/incident reports:" %len(test_filenames))

test_corpus_raw = u""
for text_file in test_filenames:
    txt_file = open(text_file, 'r')
    test_corpus_raw += unicode(txt_file.readlines())
print("Test Corpus is now {0} characters long".format(len(test_corpus_raw)))

test_raw_sentences = tokenizer.tokenize(test_corpus_raw)

def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

test_sentences = []
for raw_sentence in test_raw_sentences:
    if len(raw_sentence) > 0:
        test_sentences.append(sentence_to_wordlist(raw_sentence))

test_token_count = sum([len(sentence) for sentence in test_sentences])
print("The test corpus contains {0:,} tokens".format(test_token_count))


####### THIS LAST LINE PRODUCES AN ERROR: TypeError: don't know how to handle uri 
texts2vec.wv.accuracy(test_sentences, case_insensitive=True)
Run Code Online (Sandbox Code Playgroud)

我不知道如何修复这最后一部分。请帮忙。提前致谢!

goj*_*omo 8

accuracy()一个方法gensim字向量模型(现在的冷遇相比evaluate_word_analogies())不会把你的文本输入-它需要的字,类推挑战专门格式的文件。这个文件通常被命名为questions-words.txt.

这是一种测试通用词向量的流行方法,可以追溯到 Google 的原始 Word2Vec 论文和代码发布。

However, this evaluation doesn't necessarily indicate which word-vectors will be best for your needs. (For example, it's possible for a set of word-vectors to score better on these kinds of analogies, but be worse for a specific classification or info-retrieval goal.)

For good vectors for your own purposes, you should devise some task-specific evaluation, that gives a score correlated with the success on your final goal.

Also, note that as an unsupervised algorithm, word-vectors don't necessarily need a held-out test set to be evaluated. You generally want to use as much data as possible to train the word-vectors – ensuring maximal vocabulary coverage, with the most examples per word. Then you might test the word-vectors to some external standard – like the analogy questions, that weren't part of the training set at all.

Or, you'd just use the word-vectors as an additional input to some downstream task you're testing, and on that downstream task you'd withhold a test set from what's used to train some supervised algorithm. That ensures your supervised method isn't just memorizing/overfitting the labeled inputs, and gives you an indirect quality signal about whether that word-vector set helped the downstream task, or not. (And, that word-vector set could be compared against others based on how well they help that other supervised task – not against their own same unsupervised train-up step.)

  • 感谢您的精彩回复! (2认同)