Ada*_*m_G 2 python nltk frequency-distribution
我试图将一个单词列表(一个标记化的字符串)分成每个可能的子字符串.然后我想在每个子字符串上运行一个FreqDist,以找到最常见的子字符串.第一部分工作正常.但是,当我运行FreqDist时,我收到错误:
TypeError: unhashable type: 'list'
Run Code Online (Sandbox Code Playgroud)
这是我的代码:
import nltk
string = ['This','is','a','sample']
substrings = []
count1 = 0
count2 = 0
for word in string:
while count2 <= len(string):
if count1 != count2:
temp = string[count1:count2]
substrings.append(temp)
count2 += 1
count1 +=1
count2 = count1
print substrings
fd = nltk.FreqDist(substrings)
print fd
Run Code Online (Sandbox Code Playgroud)
输出substrings很好.这里是:
[['This'], ['This', 'is'], ['This', 'is', 'a'], ['This', 'is', 'a', 'sample'], ['is'], ['is', 'a'], ['is', 'a', 'sample'], ['a'], ['a', 'sample'], ['sample']]
Run Code Online (Sandbox Code Playgroud)
但是,我只是不能让FreqDist在它上面运行.任何见解将不胜感激.在这种情况下,每个子字符串只有一个FreqDist,但是这个程序应该在更大的文本样本上运行.
我不完全确定你想要什么,但错误信息是说它想要散列列表,这通常是它将它放入一个集合或将其用作字典键的标志.我们可以通过给它元组来解决这个问题.
>>> import nltk
>>> import itertools
>>>
>>> sentence = ['This','is','a','sample']
>>> contiguous_subs = [sentence[i:j] for i,j in itertools.combinations(xrange(len(sentence)+1), 2)]
>>> contiguous_subs
[['This'], ['This', 'is'], ['This', 'is', 'a'], ['This', 'is', 'a', 'sample'],
['is'], ['is', 'a'], ['is', 'a', 'sample'], ['a'], ['a', 'sample'],
['sample']]
Run Code Online (Sandbox Code Playgroud)
但我们还有
>>> fd = nltk.FreqDist(contiguous_subs)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 107, in __init__
self.update(samples)
File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 437, in update
self.inc(sample, count=count)
File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 122, in inc
self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'
Run Code Online (Sandbox Code Playgroud)
但是,如果我们将子序列设置为元组,则:
>>> contiguous_subs = [tuple(sentence[i:j]) for i,j in itertools.combinations(xrange(len(sentence)+1), 2)]
>>> contiguous_subs
[('This',), ('This', 'is'), ('This', 'is', 'a'), ('This', 'is', 'a', 'sample'), ('is',), ('is', 'a'), ('is', 'a', 'sample'), ('a',), ('a', 'sample'), ('sample',)]
>>> fd = nltk.FreqDist(contiguous_subs)
>>> print fd
<FreqDist: ('This',): 1, ('This', 'is'): 1, ('This', 'is', 'a'): 1, ('This', 'is', 'a', 'sample'): 1, ('a',): 1, ('a', 'sample'): 1, ('is',): 1, ('is', 'a'): 1, ('is', 'a', 'sample'): 1, ('sample',): 1>
Run Code Online (Sandbox Code Playgroud)
这就是你要找的东西吗?
| 归档时间: |
|
| 查看次数: |
3117 次 |
| 最近记录: |