OFFSET与ROW_NUMBER()

zzz*_*eek 31 postgresql

我们知道,Postgresql的OFFSET要求它扫描所有行,直到它到达你请求的位置为止,这使得通过大量结果集分页变得无用,随着OFFSET的增加而变得越来越慢.

PG 8.4现在支持窗口功能.代替:

SELECT * FROM table ORDER BY somecol LIMIT 10 OFFSET 500
Run Code Online (Sandbox Code Playgroud)

你可以说:

SELECT * FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY somecol ASC) AS rownum FROM table) AS foo
WHERE rownum > 500 AND rownum <= 510
Run Code Online (Sandbox Code Playgroud)

后一种方法对我们有帮助吗?或者我们是否必须继续使用标识列和临时表来进行大分页?

zzz*_*eek 24

我构建了一个比较OFFSET,游标和ROW_NUMBER()的测试.我对ROW_NUMBER()的印象是,无论你在结果集中的哪个位置,它的速度都是一致的,这是正确的.然而,这个速度比OFFSET或CURSOR要慢得多,正如我的印象一样,它的速度几乎相同,速度都会降低,直到你走的结果越远.

结果:

offset(100,100): 0.016359
scroll(100,100): 0.018393
rownum(100,100): 15.535614

offset(100,480000): 1.761800
scroll(100,480000): 1.781913
rownum(100,480000): 15.158601

offset(100,999900): 3.670898
scroll(100,999900): 3.664517
rownum(100,999900): 14.581068
Run Code Online (Sandbox Code Playgroud)

测试脚本使用sqlalchemy设置表和1000000行测试数据.然后,它使用psycopg2游标执行每个SELECT语句,并使用三种不同的方法获取结果.

from sqlalchemy import *

metadata = MetaData()
engine = create_engine('postgresql://scott:tiger@localhost/test', echo=True)

t1 = Table('t1', metadata,
    Column('id', Integer, primary_key=True),
    Column('d1', String(50)),
    Column('d2', String(50)),
    Column('d3', String(50)),
    Column('d4', String(50)),
    Column('d5', String(50))
)

if not engine.has_table('t1'):
    conn = engine.connect()
    t1.create(conn)

    # 1000000 rows
    for i in range(100):
        conn.execute(t1.insert(), [
            dict(
                ('d%d' % col, "data data data %d %d" % (col, (i * 10000) + j))
                for col in range(1, 6)
            ) for j in xrange(1, 10001)
        ])

import time

def timeit(fn, count, *args):
    now = time.time()
    for i in xrange(count):
        fn(*args)
    total = time.time() - now
    print "%s(%s): %f" % (fn.__name__, ",".join(repr(x) for x in args), total)

# this is a raw psycopg2 connection.
conn = engine.raw_connection()

def offset(limit, offset):
    cursor = conn.cursor()
    cursor.execute("select * from t1 order by id limit %d offset %d" % (limit, offset))
    cursor.fetchall()
    cursor.close()

def rownum(limit, offset):
    cursor = conn.cursor()
    cursor.execute("select * from (select *, "
                    "row_number() over (order by id asc) as rownum from t1) as foo "
                    "where rownum>=%d and rownum<%d" % (offset, limit + offset))
    cursor.fetchall()
    cursor.close()

def scroll(limit, offset):
    cursor = conn.cursor('foo')
    cursor.execute("select * from t1 order by id")
    cursor.scroll(offset)
    cursor.fetchmany(limit)
    cursor.close()

print 

timeit(offset, 10, 100, 100)
timeit(scroll, 10, 100, 100)
timeit(rownum, 10, 100, 100)

print 

timeit(offset, 10, 100, 480000)
timeit(scroll, 10, 100, 480000)
timeit(rownum, 10, 100, 480000)

print 

timeit(offset, 10, 100, 999900)
timeit(scroll, 10, 100, 999900)
timeit(rownum, 10, 100, 999900)
Run Code Online (Sandbox Code Playgroud)

  • 当您使用名称调用游标时,将调用psycopg2服务器端游标,如http://initd.org/psycopg/docs/usage.html#server-side-cursors.如果从对cursor()的调用中删除name参数,则每次调用上面的"scroll"函数大约需要10秒 - 这是因为如果你没有打开服务器端游标,psycopg2会完全加载结果集,在上面的例子中它在电线上拉了1M行. (5认同)