aja*_*ahu 4 python word-count python-2.7
我有一个数据集如下:
"485","AlterNet","Statistics","Estimation","Narnia","Two and half men"
"717","I like Sheen", "Narnia", "Statistics", "Estimation"
"633","MachineLearning","AI","I like Cars, but I also like bikes"
"717","I like Sheen","MachineLearning", "regression", "AI"
"136","MachineLearning","AI","TopGear"
Run Code Online (Sandbox Code Playgroud)
等等
我想找出最常出现的单词对,例如
(Statistics,Estimation:2)
(Statistics,Narnia:2)
(Narnia,Statistics)
(MachineLearning,AI:3)
Run Code Online (Sandbox Code Playgroud)
这两个词可以是任何顺序,也可以是彼此之间的任何距离
有人可以在python中提出可能的解决方案吗?这是一个非常大的数据集.
任何建议都非常感谢
所以这是我在@ 275365的建议后尝试的
@ 275365我尝试从文件中读取输入以下内容
def collect_pairs(file):
pair_counter = Counter()
for line in open(file):
unique_tokens = sorted(set(line))
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
print pair_counter
file = ('myfileComb.txt')
p=collect_pairs(file)
Run Code Online (Sandbox Code Playgroud)
文本文件与原始文件具有相同的行数,但在特定行中只有唯一的标记.我不知道我做错了什么,因为当我运行它时,它会将字母分成字母,而不是将输出作为单词的组合.当我运行此文件时,它会输出拆分字母而不是预期的单词组合.我不知道我在哪里弄错了.
您可以从这样的事情开始,具体取决于您的语料库的大小:
>>> from itertools import combinations
>>> from collections import Counter
>>> def collect_pairs(lines):
pair_counter = Counter()
for line in lines:
unique_tokens = sorted(set(line)) # exclude duplicates in same line and sort to ensure one word is always before other
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter
Run Code Online (Sandbox Code Playgroud)
结果:
>>> t2 = [['485', 'AlterNet', 'Statistics', 'Estimation', 'Narnia', 'Two and half men'], ['717', 'I like Sheen', 'Narnia', 'Statistics', 'Estimation'], ['633', 'MachineLearning', 'AI', 'I like Cars, but I also like bikes'], ['717', 'I like Sheen', 'MachineLearning', 'regression', 'AI'], ['136', 'MachineLearning', 'AI', 'TopGear']]
>>> pairs = collect_pairs(t2)
>>> pairs.most_common(3)
[(('MachineLearning', 'AI'), 3), (('717', 'I like Sheen'), 2), (('Statistics', 'Estimation'), 2)]
Run Code Online (Sandbox Code Playgroud)
你想要这些组合中包含的数字吗?由于你没有特别提及排除它们,我把它们包括在这里.
编辑:使用文件对象
您在上面第一次尝试时发布的功能非常接近工作.您唯一需要做的就是将每一行(这是一个字符串)更改为元组或列表.假设你的数据看上去完全像你上面贴(与周围每学期引号和逗号分隔的条款),我会建议一个简单的修正数据:你可以使用ast.literal_eval.(否则,您可能需要使用某种正则表达式.)请参阅下面的修改版本ast.literal_eval:
from itertools import combinations
from collections import Counter
import ast
def collect_pairs(file_name):
pair_counter = Counter()
for line in open(file_name): # these lines are each simply one long string; you need a list or tuple
unique_tokens = sorted(set(ast.literal_eval(line))) # eval will convert each line into a tuple before converting the tuple to a set
combos = combinations(unique_tokens, 2)
pair_counter += Counter(combos)
return pair_counter # return the actual Counter object
Run Code Online (Sandbox Code Playgroud)
现在你可以像这样测试它:
file_name = 'myfileComb.txt'
p = collect_pairs(file_name)
print p.most_common(10) # for example
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4504 次 |
| 最近记录: |