ype*_*eᵀᴹ 28 postgresql performance index database-design query-performance
问这个问题,特别是针对 Postgres,因为它对 R 树/空间索引有很好的支持。
我们有下表,其中包含单词及其频率的树结构(嵌套集模型):
lexikon
-------
_id integer PRIMARY KEY
word text
frequency integer
lset integer UNIQUE KEY
rset integer UNIQUE KEY
Run Code Online (Sandbox Code Playgroud)
和查询:
SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N
Run Code Online (Sandbox Code Playgroud)
我认为覆盖索引(lset, frequency, word)会很有用,但我觉得如果范围内的lset值太多,它可能表现不佳(@High, @Low)。
(frequency DESC)有时,当使用该索引的搜索早期产生@N与范围条件匹配的行时,一个简单的索引也可能就足够了。
但似乎性能在很大程度上取决于参数值。
有没有办法让它快速执行,不管范围(@Low, @High)是宽还是窄,也不管高频词是否幸运地在(窄)选择的范围内?
R-tree/空间索引有帮助吗?
添加索引,重写查询,重新设计表,没有限制。
Jac*_*las 30
您可以通过首先在频率较高的行中搜索来获得更好的性能。这可以通过“颗粒化”频率然后按程序逐步执行来实现,例如如下:
--测试台和lexikon虚拟数据:
begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial,
word text,
frequency integer,
lset integer,
width_granule integer);
--
insert into lexikon(word, frequency, lset)
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'
Run Code Online (Sandbox Code Playgroud)
granule 分析(主要用于信息和调整):
create table granule as
select width_granule, count(*) as freq,
min(frequency) as granule_start, max(frequency) as granule_end
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
width_granule | freq | granule_start | granule_end
---------------+--------+---------------+-------------
0 | 500000 | 1 | 1
1 | 300000 | 2 | 4
2 | 123077 | 5 | 12
3 | 47512 | 13 | 33
4 | 18422 | 34 | 90
5 | 6908 | 91 | 244
6 | 2580 | 245 | 665
7 | 949 | 666 | 1808
8 | 349 | 1811 | 4901
9 | 129 | 4926 | 13333
10 | 47 | 13513 | 35714
11 | 17 | 37037 | 90909
12 | 7 | 100000 | 250000
13 | 2 | 333333 | 500000
14 | 1 | 1000000 | 1000000
*/
alter table granule drop column freq;
--
Run Code Online (Sandbox Code Playgroud)
首先扫描高频的功能:
create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
m integer;
n integer := 0;
r record;
begin
for r in (select width_granule from granule order by width_granule desc) loop
return query( select *
from lexikon
where width_granule=r.width_granule
and lset>=p_lset_low and lset<=p_lset_high );
get diagnostics m = row_count;
n = n+m;
exit when n>=p_limit;
end loop;
end;$$;
Run Code Online (Sandbox Code Playgroud)
结果(时间可能应该加一点盐,但每个查询运行两次以应对任何缓存)
首先使用我们编写的函数:
\timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 0.510 ms
*/
Run Code Online (Sandbox Code Playgroud)
然后使用简单的索引扫描:
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 51.250 ms
*/
\timing off
--
rollback;
Run Code Online (Sandbox Code Playgroud)
根据您的真实数据,您可能希望改变颗粒的数量和用于将行放入其中的函数。频率的实际分布在这里很关键,limit条款的预期值和lset范围大小也是如此。
Erw*_*ter 23
我建立在@Jack 的设置之上,让人们更容易关注和比较。用PostgreSQL 9.1.4测试。
CREATE TABLE lexikon (
lex_id serial PRIMARY KEY
, word text
, frequency int NOT NULL -- we'd need to do more if NULL was allowed
, lset int
);
INSERT INTO lexikon(word, frequency, lset)
SELECT 'w' || g -- shorter with just 'w'
, (1000000 / row_number() OVER (ORDER BY random()))::int
, g
FROM generate_series(1,1000000) g
Run Code Online (Sandbox Code Playgroud)
从这里开始,我采取了不同的路线:
ANALYZE lexikon;
Run Code Online (Sandbox Code Playgroud)
此解决方案不会向原始表添加列,它只需要一个很小的辅助表。我将它放在 schema 中public,使用您选择的任何 schema。
CREATE TABLE public.lex_freq AS
WITH x AS (
SELECT DISTINCT ON (f.row_min)
f.row_min, c.row_ct, c.frequency
FROM (
SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
FROM lexikon
GROUP BY 1
) c
JOIN ( -- list of steps in recursive search
VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
) f(row_min) ON c.row_ct >= f.row_min -- match next greater number
ORDER BY f.row_min, c.row_ct, c.frequency DESC
)
, y AS (
SELECT DISTINCT ON (frequency)
row_min, row_ct, frequency AS freq_min
, lag(frequency) OVER (ORDER BY row_min) AS freq_max
FROM x
ORDER BY frequency, row_min
-- if one frequency spans multiple ranges, pick the lowest row_min
)
SELECT row_min, row_ct, freq_min
, CASE freq_min <= freq_max
WHEN TRUE THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
WHEN FALSE THEN 'frequency = ' || freq_min
ELSE 'frequency >= ' || freq_min
END AS cond
FROM y
ORDER BY row_min;
Run Code Online (Sandbox Code Playgroud)
表看起来像这样:
row_min | row_ct | freq_min | cond
--------+---------+----------+-------------
400 | 400 | 2500 | frequency >= 2500
1600 | 1600 | 625 | frequency >= 625 AND frequency < 2500
6400 | 6410 | 156 | frequency >= 156 AND frequency < 625
25000 | 25000 | 40 | frequency >= 40 AND frequency < 156
100000 | 100000 | 10 | frequency >= 10 AND frequency < 40
200000 | 200000 | 5 | frequency >= 5 AND frequency < 10
400000 | 500000 | 2 | frequency >= 2 AND frequency < 5
600000 | 1000000 | 1 | frequency = 1
Run Code Online (Sandbox Code Playgroud)
由于该列cond将在更深入的动态 SQL 中使用,因此您必须确保该表安全。如果您不能确定适当的 current search_path,请始终对表进行模式限定,并从public(和任何其他不受信任的角色)撤消写入权限:
REVOKE ALL ON public.lex_freq FROM public;
GRANT SELECT ON public.lex_freq TO public;
Run Code Online (Sandbox Code Playgroud)
该表有lex_freq三个用途:
此DO语句创建所有需要的索引:
DO
$$
DECLARE
_cond text;
BEGIN
FOR _cond IN
SELECT cond FROM public.lex_freq
LOOP
IF _cond LIKE 'frequency =%' THEN
EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
ELSE
EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
END IF;
END LOOP;
END
$$
Run Code Online (Sandbox Code Playgroud)
所有这些部分索引一起跨越表一次。它们与整个表上的一个基本索引的大小大致相同:
SELECT pg_size_pretty(pg_relation_size('lexikon')); -- 50 MB
SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB
Run Code Online (Sandbox Code Playgroud)
到目前为止,50 MB 的表只有 21 MB 的索引。
我在 上创建了大部分部分索引(lset, frequency DESC)。第二列仅在特殊情况下有帮助。但是由于两个涉及的列都是 type integer,由于数据对齐的细节与 PostgreSQL 中的MAXALIGN 结合,第二列不会使索引变大。这是一个几乎没有任何成本的小胜利。
对于仅跨越单个频率的部分索引,这样做毫无意义。那些只是在(lset)。创建的索引如下所示:
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
-- ...
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
CREATE INDEX ON lexikon(lset) WHERE freqency = 1;
Run Code Online (Sandbox Code Playgroud)
该函数在风格上与@Jack 的解决方案有些相似:
CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
RETURNS SETOF lexikon
$func$
DECLARE
_n int;
_rest int := _limit; -- init with _limit param
_cond text;
BEGIN
FOR _cond IN
SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
LOOP
-- RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
RETURN QUERY EXECUTE '
SELECT *
FROM public.lexikon
WHERE ' || _cond || '
AND lset >= $1
AND lset <= $2
ORDER BY frequency DESC
LIMIT $3'
USING _lset_min, _lset_max, _rest;
GET DIAGNOSTICS _n = ROW_COUNT;
_rest := _rest - _n;
EXIT WHEN _rest < 1;
END LOOP;
END
$func$ LANGUAGE plpgsql STABLE;
Run Code Online (Sandbox Code Playgroud)
主要区别:
动态 SQL与RETURN QUERY EXECUTE.
当我们循环执行这些步骤时,可能会受益于不同的查询计划。静态 SQL 的查询计划生成一次然后重用 - 这可以节省一些开销。但在这种情况下,查询很简单,而且值非常不同。动态 SQL 将是一个巨大的胜利。
LIMIT每个查询步骤都是动态的。
这有多种帮助:首先,仅根据需要获取行。结合动态 SQL,这也可能会生成不同的查询计划。第二:不需要LIMIT在函数调用中额外修剪多余的部分。
我选择了四个示例,并对每个示例进行了三个不同的测试。我把最好的五个与热缓存进行比较:
表单的原始 SQL 查询:
SELECT *
FROM lexikon
WHERE lset >= 20000
AND lset <= 30000
ORDER BY frequency DESC
LIMIT 5;
Run Code Online (Sandbox Code Playgroud)创建此索引后相同
CREATE INDEX ON lexikon(lset);
Run Code Online (Sandbox Code Playgroud)
需要与我所有部分索引相同的空间:
SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB
Run Code Online (Sandbox Code Playgroud)功能
SELECT * FROM f_search(20000, 30000, 5);
Run Code Online (Sandbox Code Playgroud)SELECT * FROM f_search(20000, 30000, 5);Run Code Online (Sandbox Code Playgroud)
1:总运行时间:315.458 ms
2:总运行时间:36.458 ms
3:总运行时间:0.330 ms
SELECT * FROM f_search(60000, 65000, 100);Run Code Online (Sandbox Code Playgroud)
1:总运行时间:294.819 ms
2:总运行时间:18.915 ms
3:总运行时间:1.414 ms
SELECT * FROM f_search(10000, 70000, 100);Run Code Online (Sandbox Code Playgroud)
1:总运行时间:426.831 ms
2:总运行时间:217.874 ms
3:总运行时间:1.611 ms
SELECT * FROM f_search(1, 1000000, 5);Run Code Online (Sandbox Code Playgroud)
1:总运行时间:2458.205 毫秒
2:总运行时间:2458.205 毫秒——对于大范围的 lset,seq 扫描比索引快。
3:总运行时间:0.266 毫秒
正如预期的那样,函数的好处随着 的范围越来越大lset和越来越小而增长LIMIT。
对于非常小的范围lset,原始查询与索引的结合实际上更快。你会想要测试,也许分支:小范围的原始查询lset,否则函数调用。您甚至可以将其构建到“两全其美”的功能中 - 这就是我要做的。
根据您的数据分布和典型查询,更多步骤lex_freq可能有助于提高性能。测试以找到最佳位置。使用这里介绍的工具,它应该很容易测试。
| 归档时间: |
|
| 查看次数: |
4312 次 |
| 最近记录: |