什么是itertools.groupby()用于?

chi*_*imo 8 python python-itertools

在阅读python文档时,我遇到了这个itertools.groupby() 函数.这不是很简单所以我决定在stackoverflow上查找一些信息.我从如何使用Python的itertools.groupby()中找到了一些东西.

在这里和文档中似乎没有关于它的信息,所以我决定发表我的意见以征求意见.

谢谢

chi*_*imo 16

首先,您可以在此处阅读文档.

我会把我认为最重要的一点放在第一位.我希望在这些例子之后,理由会变得清晰.

总是将具有相同键的分类项目用于分组以避免意外结果

itertools.groupby(iterable, key=None or some func) 获取可迭代列表并根据指定的键对它们进行分组.键指定要应用于每个可迭代的操作,然后将其结果用作每个对项目进行分组的标题; 最终具有相同"键"值的项目将最终出现在同一组中.

返回值是类似于字典的可迭代,因为它是形式{key : value}.

例1

# note here that the tuple counts as one item in this list. I did not
# specify any key, so each item in the list is a key on its own.
c = groupby(['goat', 'dog', 'cow', 1, 1, 2, 3, 11, 10, ('persons', 'man', 'woman')])
dic = {}
for k, v in c:
    dic[k] = list(v)
dic
Run Code Online (Sandbox Code Playgroud)

结果是

{1: [1, 1],
 'goat': ['goat'],
 3: [3],
 'cow': ['cow'],
 ('persons', 'man', 'woman'): [('persons', 'man', 'woman')],
 10: [10],
 11: [11],
 2: [2],
 'dog': ['dog']}
Run Code Online (Sandbox Code Playgroud)

例2

# notice here that mulato and camel don't show up. only the last element with a certain key shows up, like replacing earlier result
# the last result for c actually wipes out two previous results.

list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
               'wombat', 'mongoose', 'malloo', 'camel']
c = groupby(list_things, key=lambda x: x[0])
dic = {}
for k, v in c:
    dic[k] = list(v)
dic
Run Code Online (Sandbox Code Playgroud)

结果是

{'c': ['camel'],
 'd': ['dog', 'donkey'],
 'g': ['goat'],
 'm': ['mongoose', 'malloo'],
 'persons': [('persons', 'man', 'woman')],
 'w': ['wombat']}
Run Code Online (Sandbox Code Playgroud)

现在为排序版本

 # but observe the sorted version where I have the data sorted first on same key I used for grouping
list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
               'wombat', 'mongoose', 'malloo', 'camel']
sorted_list = sorted(list_things, key = lambda x: x[0])
print(sorted_list)
print()
c = groupby(sorted_list, key=lambda x: x[0])
dic = {}
for k, v in c:
    dic[k] = list(v)
dic
Run Code Online (Sandbox Code Playgroud)

结果是

['cow', 'cat', 'camel', 'dog', 'donkey', 'goat', 'mulato', 'mongoose', 'malloo', ('persons', 'man', 'woman'), 'wombat']
{'c': ['cow', 'cat', 'camel'],
 'd': ['dog', 'donkey'],
 'g': ['goat'],
 'm': ['mulato', 'mongoose', 'malloo'],
 'persons': [('persons', 'man', 'woman')],
 'w': ['wombat']}
Run Code Online (Sandbox Code Playgroud)

例3

things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "harley"), \
          ("vehicle", "speed boat"), ("vehicle", "school bus")]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
    dic[key] = list(group)
dic
Run Code Online (Sandbox Code Playgroud)

结果是

{'animal': [('animal', 'bear'), ('animal', 'duck')],
 'plant': [('plant', 'cactus')],
 'vehicle': [('vehicle', 'harley'),
  ('vehicle', 'speed boat'),
  ('vehicle', 'school bus')]}
Run Code Online (Sandbox Code Playgroud)

现在为排序版本.我在这里将元组更改为列表.无论哪种结果都相同.

things = [["animal", "bear"], ["animal", "duck"], ["vehicle", "harley"], ["plant", "cactus"], \
          ["vehicle", "speed boat"], ["vehicle", "school bus"]]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
    dic[key] = list(group)
dic
Run Code Online (Sandbox Code Playgroud)

结果是

{'animal': [['animal', 'bear'], ['animal', 'duck']],
 'plant': [['plant', 'cactus']],
 'vehicle': [['vehicle', 'harley'],
  ['vehicle', 'speed boat'],
  ['vehicle', 'school bus']]}
Run Code Online (Sandbox Code Playgroud)


MSe*_*ert 5

一如既往,该功能文档应该是第一个检查的地方.然而itertools.groupby,肯定是最棘手的,itertools因为它有一些可能的陷阱:

  • 它只对项目进行分组,如果它们的key-result对于连续项目是相同的:

    from itertools import groupby
    
    for key, group in groupby([1,1,1,1,5,1,1,1,1,4]):
        print(key, list(group))
    # 1 [1, 1, 1, 1]
    # 5 [5]
    # 1 [1, 1, 1, 1]
    # 4 [4]
    
    Run Code Online (Sandbox Code Playgroud)

    人们sorted之前可以使用- 如果想要做一个整体groupby.

  • 它产生两个项目,第二个是迭代器(因此需要迭代第二个项目!).我明确需要list在前面的例子中将它们转换为a .

  • 如果推进groupby-iterator,则丢弃第二个产生的元素:

    it = groupby([1,1,1,1,5,1,1,1,1,4])
    key1, group1 = next(it)
    key2, group2 = next(it)
    print(key1, list(group1))
    # 1 []
    
    Run Code Online (Sandbox Code Playgroud)

    即使group1不是空的!

如前所述,可以使用sorted一个整体groupby操作但效率非常低(如果你想在生成器上使用groupby,则会丢失内存效率).如果您无法保证输入sorted(也不需要O(n log(n))排序时间开销),则可以使用更好的替代方案:

但是检查当地的房产非常棒.itertools-recipes部分有两个食谱:

def all_equal(iterable):
    "Returns True if all the elements are equal to each other"
    g = groupby(iterable)
    return next(g, True) and not next(g, False)
Run Code Online (Sandbox Code Playgroud)

和:

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBCcAD', str.lower) --> A B C A D
    return map(next, map(itemgetter(1), groupby(iterable, key)))
Run Code Online (Sandbox Code Playgroud)