用于有序查询集的Django Queryset Iterator

Ash*_*pta 2 python django django-models

我想使用queryset迭代器来迭代大型数据集。Django提供iterator()了此功能,但是每次迭代都会命中数据库。我发现以下代码可以进行大块迭代-

  def queryset_iterator(queryset, chunksize=1000):
    '''''
    Iterate over a Django Queryset ordered by the primary key
    This method loads a maximum of chunksize (default: 1000) rows in it's
    memory at the same time while django normally would load all rows in it's
    memory. Using the iterator() method only causes it to not preload all the
    classes.
    Note that the implementation of the iterator
    does not support ordered query sets.
    '''
    pk = 0
    last_pk = queryset.order_by('-pk').values_list('pk', flat=True).first()
    if last_pk is not None:
        queryset = queryset.order_by('pk')
        while pk < last_pk:
            for row in queryset.filter(pk__gt=pk)[:chunksize]:
                pk = row.pk
                yield row
            gc.collect()
Run Code Online (Sandbox Code Playgroud)

这适用于无序查询集。是否有任何解决方案/解决方法可对有序查询集执行此操作?

Igo*_*min 5

这是我的,具有排序功能。

顺便说一下,您正在使用的迭代器在处理过程中会修改查询集项目:删除或添加,甚至是一项,都会出现“永远循环”。

并且下面的迭代器对last_pk没有无用的查询

def queryset_iterator(queryset, chunksize=10000, key=None):
    key = [key] if isinstance(key, basestring) else (key or ['pk'])
    counter = 0
    count = chunksize
    while count == chunksize:
        offset = counter - counter % chunksize
        count = 0
        for item in queryset.all().order_by(*key)[offset:offset + chunksize]:
            count += 1
            yield item
        counter += count
        gc.collect()
Run Code Online (Sandbox Code Playgroud)