dmi*_*mil 5 python nlp gensim topic-modeling
我只是对gensim字典的实现感到好奇.我有以下代码:
def build_dictionary(documents):
dictionary = corpora.Dictionary(documents)
dictionary.save('/tmp/deerwester.dict') # store the dictionary
return dictionary
Run Code Online (Sandbox Code Playgroud)
我查看了文件deerwester.dict,它看起来像这样:
8002 6367 656e 7369 6d2e 636f 7270 6f72
612e 6469 6374 696f 6e61 7279 0a44 6963
7469 6f6e 6172 790a 7101 2981 7102 7d71
0328 5508 6e75 6d5f 646f 6373 7104 4b09
5508 ...
Run Code Online (Sandbox Code Playgroud)
但是,以下代码
my_dict = dictionary.load('/tmp/deerwester.dict')
print my_dict.token2id #view dictionary
Run Code Online (Sandbox Code Playgroud)
得出这个:
{'minors': 30, 'generation': 22, 'testing': 16, 'iv': 29, 'engineering': 15, 'computer': 2, 'relation': 20, 'human': 3, 'measurement': 18, 'unordered': 25, 'binary': 21, 'abc': 0, 'ordering': 31, 'graph': 26, 'system': 10, 'machine': 6, 'quasi': 32, 'random': 23, 'paths': 28, 'error': 17, 'trees': 24, 'lab': 5, 'applications': 1, 'management': 14, 'user': 12, 'interface': 4, 'intersection': 27, 'response': 8, 'perceived': 19, 'widths': 34, 'well': 33, 'eps': 13, 'survey': 9, 'time': 11, 'opinion': 7}
Run Code Online (Sandbox Code Playgroud)
所以我的问题是,因为我没有看到.dict文件中的实际单词,那里存储的所有十六进制值是什么?这是某种超级压缩格式吗?我很好奇,因为我觉得如果是这样,我应该考虑从现在开始使用它.
举个例子:
>>> from gensim import corpora
>>> docs = ["this is a foo bar", "you are a foo"]
>>> texts = [[i for i in doc.lower().split()] for doc in docs]
>>> print texts
[['this', 'is', 'a', 'foo', 'bar'], ['you', 'are', 'a', 'foo']]
>>> dictionary = corpora.Dictionary(texts)
>>> dictionary.save('foobar.txtdic')
Run Code Online (Sandbox Code Playgroud)
如果你使用gensim.corpora.dictionary.save_as_text()(参见https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/dictionary.py),你应该得到以下文本文件:
0 a 2
5 are 1
1 bar 1
2 foo 2
3 is 1
4 this 1
6 you 1
Run Code Online (Sandbox Code Playgroud)
如果使用默认值gensim.corpora.dictionary.save(),它将保存到pickled二进制文件中.见class SaveLoad(object)在https://github.com/piskvorky/gensim/blob/develop/gensim/utils.py
有关信息pickle,请参阅http://docs.python.org/2/library/pickle.html#pickle-example
| 归档时间: |
|
| 查看次数: |
7916 次 |
| 最近记录: |