将1000个ID填充到Postgres的SELECT ... WHERE ... IN(...)查询中是否合理？

Question

将1000个ID填充到Postgres的SELECT ... WHERE ... IN(...)查询中是否合理？

Dav*_*Eyk 9 sql postgresql search

可能重复:
PostgreSQL - "IN"子句中的最大参数数量？

我正在开发一个Web API,以便对可以很好地映射到Postgres表的资源执行RESTful查询.大多数过滤参数也很好地映射到SQL查询的参数.但是,一些过滤参数需要调用我的搜索索引(在本例中为Sphinx服务器).

最简单的方法是运行我的搜索,从搜索结果中收集主键,然后将它们填入IN (...)SQL查询的子句中.但是,由于搜索可能会返回很多主键,我不知道这是一个如此明智的想法.

我希望大部分时间(比方说,90%),我的搜索将返回几百个数量级的结果.也许10%的时间,会有几千个结果的顺序.

这是一种合理的方法吗？有没有更好的办法？

Answer 1

dbe*_*hur 14

我强烈赞成用实验方法来回答性能问题.@Catcall开了个好头,但他的实验大小比许多真正的数据库要小得多.他的300000个单整数行很容易适合内存,所以没有IO出现; 此外,他没有分享实际数字.

我编写了一个类似的实验,但是将样本数据的大小调整为主机上可用内存的7倍(1GB 1-fracctional CPU VM上的7GB数据集,NFS安装文件系统).有30,000,000行由单个索引bigint和0到400字节之间的随机长度字符串组成.

create table t(id bigint primary key, stuff text);
insert into t(id,stuff) select i, repeat('X',(random()*400)::integer)
from generate_series(0,30000000) i;
analyze t;

Run Code Online (Sandbox Code Playgroud)

以下是解释分析键域中的10,100,1,000,10,000和100,000个随机整数的选择IN的运行时间.每个查询都采用以下形式,$ 1由设置计数替换.

explain analyze
select id from t
where id in (
  select (random()*30000000)::integer from generate_series(0,$1)
);

Run Code Online (Sandbox Code Playgroud)

摘要时报

ct,tot ms,ms/row
10,84,8.4
100,1185,11.8
1,000,12407,12.4
10,000,109747,11.0
100,000,1016842,10.1

请注意,对于每个IN集基数,计划保持不变 - 构建随机整数的哈希聚合,然后循环并为每个值执行单个索引查找.取指时间与IN集的基数接近线性,在8-12 ms /行范围内.更快的存储系统无疑会显着改善这些时间,但实验表明,Pg在aplomb中处理非常大的集合 - 至少从执行速度的角度来看.请注意,如果通过sql语句的bind-parameter或literal interpolation提供列表,则会对查询到服务器的网络传输产生额外的开销,并增加解析时间,但我怀疑它们与IO相比可以忽略不计提取查询的时间.

# fetch 10
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=0.110..84.494 rows=11 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.046..0.054 rows=11 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.036..0.039 rows=11 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=7.672..7.673 rows=1 loops=11)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 84.580 ms


# fetch 100
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=12.405..1184.758 rows=101 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.095..0.210 rows=101 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.046..0.067 rows=101 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=11.723..11.725 rows=1 loops=101)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 1184.843 ms

# fetch 1,000
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=14.403..12406.667 rows=1001 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.609..1.689 rows=1001 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.128..0.332 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=12.381..12.390 rows=1 loops=1001)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 12407.059 ms

# fetch 10,000
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=21.884..109743.854 rows=9998 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=5.761..18.090 rows=9998 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=1.004..3.087 rows=10001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=10.968..10.972 rows=1 loops=9998)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 109747.169 ms

# fetch 100,000
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=110.244..1016781.944 rows=99816 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=110.169..253.947 rows=99816 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=51.141..77.482 rows=100001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=10.176..10.181 rows=1 loops=99816)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 1016842.772 ms

Run Code Online (Sandbox Code Playgroud)

在@Catcall的请求下,我使用CTE和临时表运行了类似的查询.两种方法都具有相对简单的嵌套循环索引扫描计划,并且在内联IN查询中可比较(但稍慢)运行.

-- CTE
EXPLAIN analyze
with ids as (select (random()*30000000)::integer as val from generate_series(0,1000))
select id from t where id in (select ids.val from ids);

 Nested Loop  (cost=40.00..2351.27 rows=15002521 width=8) (actual time=21.203..12878.329 rows=1001 loops=1)
   CTE ids
     ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.085..0.306 rows=1001 loops=1)
   ->  HashAggregate  (cost=22.50..24.50 rows=200 width=4) (actual time=0.771..1.907 rows=1001 loops=1)
         ->  CTE Scan on ids  (cost=0.00..20.00 rows=1000 width=4) (actual time=0.087..0.552 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=12.859..12.861 rows=1 loops=1001)
         Index Cond: (t.id = ids.val)
 Total runtime: 12878.812 ms
(8 rows)

-- Temp table
create table temp_ids as select (random()*30000000)::bigint as val from generate_series(0,1000);

explain analyze select id from t where t.id in (select val from temp_ids);

 Nested Loop  (cost=17.51..11585.41 rows=1001 width=8) (actual time=7.062..15724.571 rows=1001 loops=1)
   ->  HashAggregate  (cost=17.51..27.52 rows=1001 width=8) (actual time=0.268..1.356 rows=1001 loops=1)
         ->  Seq Scan on temp_ids  (cost=0.00..15.01 rows=1001 width=8) (actual time=0.007..0.080 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=15.703..15.705 rows=1 loops=1001)
         Index Cond: (t.id = temp_ids.val)
 Total runtime: 15725.063 ms

-- another way using join against temptable insteed of IN
explain analyze select id from t join temp_ids on (t.id = temp_ids.val);

Nested Loop  (cost=0.00..24687.88 rows=2140 width=8) (actual time=22.594..16557.789 rows=1001 loops=1)
   ->  Seq Scan on temp_ids  (cost=0.00..31.40 rows=2140 width=8) (actual time=0.014..0.872 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.51 rows=1 width=8) (actual time=16.536..16.537 rows=1 loops=1001)
         Index Cond: (t.id = temp_ids.val)
 Total runtime: 16558.331 ms

Run Code Online (Sandbox Code Playgroud)

如果再次运行,临时表查询的运行速度要快得多,但那是因为id值设置是常量,所以目标数据在缓存中是新鲜的,而Pg没有真正的IO来执行第二次.

Answer 2

Mik*_*ll' 5

我有些天真的测试表明,使用IN (...)至少比临时表上的连接和公共表表达式上的连接快一个数量级.(坦率地说,这让我感到惊讶.)我测试了300000行表中的3000个整数值.

create table integers (
  n integer primary key
);
insert into integers
select generate_series(0, 300000);

-- External ruby program generates 3000 random integers in the range of 0 to 299999.
-- Used Emacs to massage the output into a SQL statement that looks like

explain analyze
select integers.n 
from integers where n in (
100109,
100354 ,
100524 ,
...
);

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，10 月前
查看次数：	2754 次
最近记录：	9 年前