序列中的n个最大元素(需要保留重复)

Pra*_*ota 8 python sorting algorithm heap sequence

我需要在元组列表中找到n个最大的元素.这是前3个元素的示例.

# I have a list of tuples of the form (category-1, category-2, value)
# For each category-1, ***values are already sorted descending by default***
# The list can potentially be approximately a million elements long.
lot = [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), 
       ('a', 'x4',  8), ('a', 'x5', 8), ('a', 'x6', 7),
       ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8), 
       ('b', 'x4',  7), ('b', 'x5', 6), ('b', 'x6', 5)]

# This is what I need. 
# A list of tuple with top-3 largest values for each category-1
ans = [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), 
       ('a', 'x4', 8), ('a', 'x5', 8),
       ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8)]
Run Code Online (Sandbox Code Playgroud)

我试过用heapq.nlargest.但是它只返回前3个最大的元素,并且不返回重复项.例如,

heapq.nlargest(3, [10, 10, 10, 9, 8, 8, 7, 6])
# returns
[10, 10, 10]
# I need
[10, 10, 10, 9, 8, 8]
Run Code Online (Sandbox Code Playgroud)

我只能想到蛮力的做法.这就是我拥有的,它的工作原理.

res, prev_t, count = [lot[0]], lot[0], 1
for t in lot[1:]:
    if t[0] == prev_t[0]:
        count = count + 1 if t[2] != prev_t[2] else count
        if count <= 3:
            res.append(t)   
    else:
        count = 1
        res.append(t)
    prev_t = t

print res
Run Code Online (Sandbox Code Playgroud)

关于如何实现这个的任何其他想法?谢谢!

编辑:timeit100万元素列表的结果表明,mhyfritz的解决方案在蛮力的1/3时间运行.不想让问题太长.所以在我的回答中添加了更多细节.

mhy*_*itz 7

我把它从你的代码片段lot进行分组WRT 1类.以下应该工作:

from itertools import groupby, islice
from operator import itemgetter

ans = []
for x, g1 in groupby(lot, itemgetter(0)):
    for y, g2 in islice(groupby(g1, itemgetter(2)), 0, 3):
        ans.extend(list(g2))

print ans
# [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), ('a', 'x4', 8), ('a', 'x5', 8),
#  ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8)]
Run Code Online (Sandbox Code Playgroud)

  • 和一个班轮:`列表(链(*(列表(g2)for x,g1 in groupby(lot,itemgetter(0))表示y,g2表示islice(groupby(g1,itemgetter(2)),0,3 ))))` (2认同)