Mas*_*yaf -5 python python-3.x
我有一个4623行的文本文件和0s和1s字符串形式的条目(例如01010111).我逐个字符地比较它们.我有几个数据集,字符串长度为100,1000和10,000.1000小时需要25小时才能计算10,000小时需要60小时.有没有办法加快速度?我尝试使用多处理库,但它只是重复值.也许我错了.码:
f = open("/path/to/file/file.txt", 'r')
l = [s.strip('\n') for s in f]
f.close()
for a in range(0, len(l)):
for b in range(0, len(l)):
if (a < b):
result = 0
if (a == b):
result = 1
else:
counter = 0
for i in range(len(l[a])):
if (int(l[a][i]) == int(l[b][i]) == 1):
counter += 1
result = counter / 10000
print((a + 1), (b + 1), result)
Run Code Online (Sandbox Code Playgroud)
我是python的新手,所以我认为这段代码需要一些优化.任何帮助都会很好.提前致谢.
你计算两个字符串为1的方式非常慢.这是一个简单的例子:
In [24]: a = '1010' * 2500
In [25]: b = '1100' * 2500
In [27]: def test1():
counter = 0
for i in range(len(a)):
if int(a[i]) == int(b[i]) == 1:
counter += 1
return counter
In [28]: %timeit test1()
100 loops, best of 3: 4.07 ms per loop
Run Code Online (Sandbox Code Playgroud)
相比之下,使用代表你的1和0字符串的东西只是位:
In [29]: aba = bitarray(a)
In [30]: bba = bitarray(b)
In [31]: def test2():
....: return (aba & bba).count()
....:
In [32]: %timeit test2()
100000 loops, best of 3: 1.99 µs per loop
Run Code Online (Sandbox Code Playgroud)
那是2045倍.所以问题不是如何加速python,而是"我应该使用什么数据结构?".
使用bitarray和10,000行1 100和0的文件,这不是最坏的情况,但是:
In [22]: from bitarray import bitarray
In [23]: testdata = open('teststrs.txt')
In [24]: l = [bitarray(line.rstrip()) for line in testdata]
In [25]: len(l)
Out[25]: 10000
In [26]: len(l[0])
Out[26]: 100
In [27]: combs = combinations(l, 2)
In [28]: %time res = [(a & b[:len(a)]).count() for a, b in combs]
CPU times: user 1min 14s, sys: 396 ms, total: 1min 15s
Wall time: 1min 15s
Run Code Online (Sandbox Code Playgroud)
或使用产品,如示例代码中所示:
In [30]: from itertools import product
In [31]: prod = product(l, repeat=2)
In [32]: %time res = [(a & b[:len(a)]).count() for a, b in prod]
CPU times: user 2min 51s, sys: 628 ms, total: 2min 52s
Wall time: 2min 51s
Run Code Online (Sandbox Code Playgroud)
注意:
我跳过你的结果处理,因为你没有打开它并且它包含死代码:
if a == b:
Run Code Online (Sandbox Code Playgroud)
将永远不会True,因为在前面,如果你检查a < b.我认为你有缩进或逻辑错误,意味着:
if a < b:
result = 0
elif a == b:
result = 1
else:
counter = 0
for i in range(len(l[a])):
if (int(l[a][i]) == int(l[b][i]) == 1):
counter += 1
result = counter / 10000
print((a + 1), (b + 1), result)
Run Code Online (Sandbox Code Playgroud)
在最坏的情况下,如果我理解正确的话:
In [1]: src = map(lambda i: '{:010000b}\n'.format(i), iter(lambda: random.getrandbits(10000), None))
In [2]: import random
In [3]: from itertools import islice
In [4]: with open('teststrs.txt', 'w') as datafile:
datafile.writelines(islice(src, 0, 4623))
...
In [35]: testdata = open('teststrs.txt')
In [36]: l = [bitarray(line.rstrip()) for line in testdata]
In [37]: prod = product(l, repeat=2)
In [38]: %time res = [(a & b).count() for a, b in prod]
CPU times: user 52.1 s, sys: 424 ms, total: 52.5 s
Wall time: 52.5 s
In [39]: len(l)
Out[39]: 4623
In [40]: len(l[0])
Out[40]: 10000
Run Code Online (Sandbox Code Playgroud)
请注意,我作弊并跳过了切片b.这是非常非常昂贵的移动周围所有的记忆,这切片会做,因为它创造了新的副本:
In [43]: %time res = [(a & b[:len(a)]).count() for a, b in prod]
CPU times: user 29min 40s, sys: 676 ms, total: 29min 41s
Wall time: 29min 40s
Run Code Online (Sandbox Code Playgroud)
因此,如果你事先知道你的最大位宽,或者甚至从你的数据中计算它,我认为用零填充较短的比特数然后做整个"计数1"将是有益的:
In [18]: def test():
with open('teststrs.txt') as testdata:
lines = [line.strip() for line in testdata]
max_len = max(map(len, lines))
l = [bitarray(line.ljust(max_len, '0')) for line in lines]
prod = product(l, repeat=2)
return [(a & b).count() for a, b in prod]
....:
In [19]: %timeit test()
1 loops, best of 3: 43.9 s per loop
Run Code Online (Sandbox Code Playgroud)
这里teststrs.txt由4623混合长度(随机选择100,1000或10000)1'和0'的字符串组成.
| 归档时间: |
|
| 查看次数: |
104 次 |
| 最近记录: |