计算字符串的tf-idf

use*_*422 0 python tf-idf scikit-learn

我有2个文件doc1.txtdoc2.txt.这两份文件的内容如下:

 #doc1.txt
 very good, very bad, you are great

 #doc2.txt
 very bad, good restaurent, nice place to visit
Run Code Online (Sandbox Code Playgroud)

我想让我的语料库分开,,以便我的最终DocumentTermMatrix成为:

      terms
 docs       very good      very bad        you are great   good restaurent   nice place to visit
 doc1       tf-idf          tf-idf         tf-idf          0                    0
 doc2       0               tf-idf         0               tf-idf             tf-idf
Run Code Online (Sandbox Code Playgroud)

我知道,如何计算DocumentTermMatrix的各个单词(使用http://scikit-learn.org/stable/modules/feature_extraction.html),但不知道如何计算DocumentTermMatrixstringsPython编写的.

YS-*_*S-L 5

您可以将analyzer参数指定TfidfVectorizer为以自定义方式提取要素的函数:

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ['very good, very bad, you are great',
        'very bad, good restaurent, nice place to visit']

tfidf = TfidfVectorizer(analyzer=lambda d: d.split(', ')).fit(docs)
print tfidf.get_feature_names()
Run Code Online (Sandbox Code Playgroud)

由此产生的功能是:

['good restaurent', 'nice place to visit', 'very bad', 'very good', 'you are great']
Run Code Online (Sandbox Code Playgroud)

如果你真的无法负担将所有数据加载到内存中,这是一个解决方法:

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ['doc1.txt', 'doc2.txt']

def extract(filename):
    with open(filename) as f:
        features = []
        for line in f:
            features += line.strip().split(', ')
        return features

tfidf = TfidfVectorizer(analyzer=extract).fit(docs)
print tfidf.get_feature_names()
Run Code Online (Sandbox Code Playgroud)

它一次加载一个文档,而不是一次将所有文档都保存在内存中.