use*_*422 0 python tf-idf scikit-learn
我有2个文件doc1.txt
和doc2.txt
.这两份文件的内容如下:
#doc1.txt
very good, very bad, you are great
#doc2.txt
very bad, good restaurent, nice place to visit
Run Code Online (Sandbox Code Playgroud)
我想让我的语料库分开,,
以便我的最终DocumentTermMatrix
成为:
terms
docs very good very bad you are great good restaurent nice place to visit
doc1 tf-idf tf-idf tf-idf 0 0
doc2 0 tf-idf 0 tf-idf tf-idf
Run Code Online (Sandbox Code Playgroud)
我知道,如何计算DocumentTermMatrix
的各个单词(使用http://scikit-learn.org/stable/modules/feature_extraction.html),但不知道如何计算DocumentTermMatrix
的strings
Python编写的.
您可以将analyzer
参数指定TfidfVectorizer
为以自定义方式提取要素的函数:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ['very good, very bad, you are great',
'very bad, good restaurent, nice place to visit']
tfidf = TfidfVectorizer(analyzer=lambda d: d.split(', ')).fit(docs)
print tfidf.get_feature_names()
Run Code Online (Sandbox Code Playgroud)
由此产生的功能是:
['good restaurent', 'nice place to visit', 'very bad', 'very good', 'you are great']
Run Code Online (Sandbox Code Playgroud)
如果你真的无法负担将所有数据加载到内存中,这是一个解决方法:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ['doc1.txt', 'doc2.txt']
def extract(filename):
with open(filename) as f:
features = []
for line in f:
features += line.strip().split(', ')
return features
tfidf = TfidfVectorizer(analyzer=extract).fit(docs)
print tfidf.get_feature_names()
Run Code Online (Sandbox Code Playgroud)
它一次加载一个文档,而不是一次将所有文档都保存在内存中.
归档时间: |
|
查看次数: |
3219 次 |
最近记录: |