CoS*_*CoS 13 python text frequency
我正在努力加快我的项目计算单词频率.我有360多个文本文件,我需要获得单词总数和来自另一个单词列表的每个单词出现的次数.我知道如何使用单个文本文件执行此操作.
>>> import nltk
>>> import os
>>> os.chdir("C:\Users\Cameron\Desktop\PDF-to-txt")
>>> filename="1976.03.txt"
>>> textfile=open(filename,"r")
>>> inputString=textfile.read()
>>> word_list=re.split('\s+',file(filename).read().lower())
>>> print 'Words in text:', len(word_list)
#spits out number of words in the textfile
>>> word_list.count('inflation')
#spits out number of times 'inflation' occurs in the textfile
>>>word_list.count('jobs')
>>>word_list.count('output')
Run Code Online (Sandbox Code Playgroud)
让"通货膨胀","就业","产出"的个人频率变得过于繁琐.我可以将这些单词放入列表中,同时查找列表中所有单词的频率吗?基本上这用Python.
示例:而不是:
>>> word_list.count('inflation')
3
>>> word_list.count('jobs')
5
>>> word_list.count('output')
1
Run Code Online (Sandbox Code Playgroud)
我想这样做(我知道这不是真正的代码,这是我要求帮助的):
>>> list1='inflation', 'jobs', 'output'
>>>word_list.count(list1)
'inflation', 'jobs', 'output'
3, 5, 1
Run Code Online (Sandbox Code Playgroud)
我的单词列表将有10-20个术语,所以我需要能够将Python指向单词列表以获得计数.如果输出能够复制+粘贴到excel电子表格中,并且单词为列,频率为行,那也很好
例:
inflation, jobs, output
3, 5, 1
Run Code Online (Sandbox Code Playgroud)
最后,任何人都可以帮助自动化所有文本文件吗?我想我只是将Python指向文件夹,它可以从新列表中为每个360+文本文件计算上述字数.看起来很简单,但我有点卡住了.有帮助吗?
像这样的输出会很棒:Filename1通胀,工作,输出3,5,1
Filename2
inflation, jobs, output
7, 2, 4
Filename3
inflation, jobs, output
9, 3, 5
Run Code Online (Sandbox Code Playgroud)
谢谢!
sot*_*pme 18
collections.Counter()如果我理解你的问题就有这个问题.
文档中的示例似乎与您的问题相符.
# Tally occurrences of words in a list
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
cnt[word] += 1
print cnt
# Find the ten most common words in Hamlet
import re
words = re.findall('\w+', open('hamlet.txt').read().lower())
Counter(words).most_common(10)
Run Code Online (Sandbox Code Playgroud)
从上面的例子中你应该能够做到:
import re
import collections
words = re.findall('\w+', open('1976.03.txt').read().lower())
print collections.Counter(words)
Run Code Online (Sandbox Code Playgroud)
编辑天真的方法来展示一种方式.
wanted = "fish chips steak"
cnt = Counter()
words = re.findall('\w+', open('1976.03.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print cnt
Run Code Online (Sandbox Code Playgroud)
一种可能的实现(使用计数器)...
我认为写入 csv 文件并将其导入 Excel 会更简单,而不是打印输出。查看http://docs.python.org/2/library/csv.html并替换print_summary.
import os
from collections import Counter
import glob
def word_frequency(fileobj, words):
"""Build a Counter of specified words in fileobj"""
# initialise the counter to 0 for each word
ct = Counter(dict((w, 0) for w in words))
file_words = (word for line in fileobj for word in line.split())
filtered_words = (word for word in file_words if word in words)
return Counter(filtered_words)
def count_words_in_dir(dirpath, words, action=None):
"""For each .txt file in a dir, count the specified words"""
for filepath in glob.iglob(os.path.join(dirpath, '*.txt')):
with open(filepath) as f:
ct = word_frequency(f, words)
if action:
action(filepath, ct)
def print_summary(filepath, ct):
words = sorted(ct.keys())
counts = [str(ct[k]) for k in words]
print('{0}\n{1}\n{2}\n\n'.format(
filepath,
', '.join(words),
', '.join(counts)))
words = set(['inflation', 'jobs', 'output'])
count_words_in_dir('./', words, action=print_summary)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
48324 次 |
| 最近记录: |