Man*_*oel 2 python performance numpy range set
我需要计算一组给定范围内的唯一元素的数量.我的输入是这些范围的起点和终点坐标,我执行以下操作.
>>>coordinates
[[7960383, 7961255],
[15688414, 15689284],
[19247797, 19248148],
[21786109, 21813057],
[21822367, 21840682],
[21815951, 21822369],
[21776839, 21783355],
[21779693, 21786111],
[21813097, 21815959],
[21776839, 21786111],
[21813097, 21819613],
[21813097, 21822369]]
[21813097, 21822369]]
>>>len(set(chain(*[range(i[0],i[1]+1) for i in coordinates]))) #here chain is from itertools
Run Code Online (Sandbox Code Playgroud)
问题是它不够快.这需要在我的机器上花费3.5ms(使用%timeit)(购买新计算机不是一种选择),因为我需要在数百万套上执行此操作,所以速度并不快.
有什么建议可以证明这一点吗?
编辑:行数可以变化.在这种情况下,有12行.但我不能给它任何上限.
你可以只取坐标之间的差值,然后减去重叠:
coordinates =[
[ 7960383, 7961255],
[15688414, 15689284],
[19247797, 19248148],
[21776839, 21786111],
[21813097, 21819613],
[21813097, 21822369]
]
# sort by increasing first coordinate, and if equal, by second:
coordinates.sort()
count = 0
prevEnd = 0
for start, end in coordinates:
if end > prevEnd: # ignore a range that is sub-range of the previous one
count += end - max(start, prevEnd)
prevEnd = end
print (count)
Run Code Online (Sandbox Code Playgroud)
这在空间和时间上都很便宜.
在编辑之后,很明显您希望第二个坐标具有包容性.在这种情况下,"更正"计算如下:
count = 0
prevEnd = -1
for start, end in coordinates:
if end > prevEnd: # ignore a range that is sub-range of the previous one
count += end - max(start - 1, prevEnd)
prevEnd = end
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
90 次 |
| 最近记录: |