Alg*_*Man 30 python sorting word-frequency
我必须使用python计算文本中的单词频率.我想把字词保存在字典中并对每个单词进行计数.
现在,如果我必须根据出现次数对单词进行排序.我可以使用相同的字典而不是使用新的字典,其中键作为计数和单词数组作为值吗?
jat*_*ism 54
警告:此示例需要Python 2.7或更高版本.
Python的内置Counter对象正是您所需要的.计算单词甚至是文档中的第一个示例:
>>> # Tally occurrences of words in a list
>>> from collections import Counter
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
... cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
Run Code Online (Sandbox Code Playgroud)
如注释中所指定,Counter采用可迭代的,因此上述示例仅用于说明,等同于:
>>> mywords = ['red', 'blue', 'red', 'green', 'blue', 'blue']
>>> cnt = Counter(mywords)
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
Run Code Online (Sandbox Code Playgroud)
Fré*_*idi 22
你可以使用相同的字典:
>>> d = { "foo": 4, "bar": 2, "quux": 3 }
>>> sorted(d.items(), key=lambda item: item[1])
Run Code Online (Sandbox Code Playgroud)
第二行打印:
[('bar', 2), ('quux', 3), ('foo', 4)]
Run Code Online (Sandbox Code Playgroud)
如果您只需要排序的单词列表,请执行以下操作:
>>> [pair[0] for pair in sorted(d.items(), key=lambda item: item[1])]
Run Code Online (Sandbox Code Playgroud)
该行打印:
['bar', 'quux', 'foo']
Run Code Online (Sandbox Code Playgroud)
我刚刚在 Stack Overflow 人员的帮助下编写了一个类似的程序:
from string import punctuation
from operator import itemgetter
N = 100
words = {}
words_gen = (word.strip(punctuation).lower() for line in open("poi_run.txt")
for word in line.split())
for word in words_gen:
words[word] = words.get(word, 0) + 1
top_words = sorted(words.items(), key=itemgetter(1), reverse=True)[:N]
for word, frequency in top_words:
print ("%s %d" % (word, frequency))
Run Code Online (Sandbox Code Playgroud)
>>> d = {'a': 3, 'b': 1, 'c': 2, 'd': 5, 'e': 0}
>>> l = d.items()
>>> l.sort(key = lambda item: item[1])
>>> l
[('e', 0), ('b', 1), ('c', 2), ('a', 3), ('d', 5)]
Run Code Online (Sandbox Code Playgroud)
您可以通过两步过程在 Python 2.7模块中使用Counter and 。 首先用于创建一个字典,其中每个单词都是一个具有相关频率计数的键。这是相当微不足道的。defaultdictcollectionsCounter
其次,defaultdict可用于创建倒排或反向字典,其中键是出现频率,关联值是出现多次的单词或单词的列表。这就是我的意思:
from collections import Counter, defaultdict
wordlist = ['red', 'yellow', 'blue', 'red', 'green', 'blue', 'blue', 'yellow']
# invert a temporary Counter(wordlist) dictionary so keys are
# frequency of occurrence and values are lists the words encountered
freqword = defaultdict(list)
for word, freq in Counter(wordlist).items():
freqword[freq].append(word)
# print in order of occurrence (with sorted list of words)
for freq in sorted(freqword):
print('count {}: {}'.format(freq, sorted(freqword[freq])))
Run Code Online (Sandbox Code Playgroud)
输出:
from collections import Counter, defaultdict
wordlist = ['red', 'yellow', 'blue', 'red', 'green', 'blue', 'blue', 'yellow']
# invert a temporary Counter(wordlist) dictionary so keys are
# frequency of occurrence and values are lists the words encountered
freqword = defaultdict(list)
for word, freq in Counter(wordlist).items():
freqword[freq].append(word)
# print in order of occurrence (with sorted list of words)
for freq in sorted(freqword):
print('count {}: {}'.format(freq, sorted(freqword[freq])))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
65034 次 |
| 最近记录: |