Sol*_*oub 12 sql postgresql performance timestamp
我有一个可以拨打电话的表,其中包含以下字段:
有29万条记录加载到本地PostgreSQL数据库中.我在ID(唯一索引),starttime和endtime上添加了索引.
在stackoverflow上搜索,我发现了一些有用的SQL并将其修改为我认为逻辑上应该工作的内容.问题是查询运行了很多个小时,从不返回:
SELECT T1.sid, count(*) as CountSimultaneous
FROM calls_nov T1, calls_nov T2
WHERE
T1.StartTime between T2.StartTime and T2.EndTime
and T1.StartTime between '2011-11-02' and '2011-11-03'
GROUP BY
T1.sid
ORDER BY CountSimultaneous DESC;
Run Code Online (Sandbox Code Playgroud)
有人可以建议一种方法来修复查询/索引,以便它实际工作或建议另一种方法来计算并发调用?
编辑:
解释计划:
Sort (cost=11796758237.81..11796758679.47 rows=176663 width=35)
Sort Key: (count(*))
-> GroupAggregate (cost=0.00..11796738007.56 rows=176663 width=35)
-> Nested Loop (cost=0.00..11511290152.45 rows=57089217697 width=35)
Run Code Online (Sandbox Code Playgroud)
表创建脚本:
CREATE TABLE calls_nov (
sid varchar,
starttime timestamp,
endtime timestamp,
call_to varchar,
call_from varchar,
status varchar);
Run Code Online (Sandbox Code Playgroud)
索引创建:
CREATE UNIQUE INDEX sid_unique_index on calls_nov (sid);
CREATE INDEX starttime_index on calls_nov (starttime);
CREATE INDEX endtime_index on calls_nov (endtime);
Run Code Online (Sandbox Code Playgroud)
这是可能的重叠的样子,其中'A'是"参考"区间.请注意,下面的查询(远远低于)并未给出与已发布的任何答案相同的结果.
-- A |------|
-- B |-|
-- C |---|
-- D |---|
-- E |---|
-- F |---|
-- G |---|
-- H |---|
-- I |---|
Run Code Online (Sandbox Code Playgroud)
"B"根本不与"A"重叠."C"紧靠它.{"D","E","F","G"}重叠."H"紧靠它."我"根本不重叠.
create table calls_nov (
sid varchar(5) primary key,
starttime timestamp not null,
endtime timestamp not null
);
insert into calls_nov values
('A', '2012-01-04 08:00:00', '2012-01-04 08:00:10'),
('B', '2012-01-04 07:50:00', '2012-01-04 07:50:03'),
('C', '2012-01-04 07:59:57', '2012-01-04 08:00:00'),
('D', '2012-01-04 07:59:57', '2012-01-04 08:00:03'),
('E', '2012-01-04 08:00:01', '2012-01-04 08:00:04'),
('F', '2012-01-04 08:00:07', '2012-01-04 08:00:10'),
('G', '2012-01-04 08:00:07', '2012-01-04 08:00:13'),
('H', '2012-01-04 08:00:10', '2012-01-04 08:00:13'),
('I', '2012-01-04 08:00:15', '2012-01-04 08:00:18');
Run Code Online (Sandbox Code Playgroud)
你可以看到这样的所有重叠间隔.(我只是使用to_char()来轻松查看所有数据.您可以在生产中省略它.)
select t1.sid, to_char(t1.starttime, 'HH12:MI:SS'),
to_char(t1.endtime, 'HH12:MI:SS'),
t2.sid, to_char(t2.starttime, 'HH12:MI:SS'),
to_char(t2.endtime, 'HH12:MI:SS')
from calls_nov t1
inner join calls_nov t2 on (t2.starttime, t2.endtime)
overlaps (t1.starttime, t1.endtime)
order by t1.sid, t2.sid;
A 08:00:00 08:00:10 A 08:00:00 08:00:10
A 08:00:00 08:00:10 D 07:59:57 08:00:03
A 08:00:00 08:00:10 E 08:00:01 08:00:04
A 08:00:00 08:00:10 F 08:00:07 08:00:10
A 08:00:00 08:00:10 G 08:00:07 08:00:13
B 07:50:00 07:50:03 B 07:50:00 07:50:03
C 07:59:57 08:00:00 C 07:59:57 08:00:00
C 07:59:57 08:00:00 D 07:59:57 08:00:03
D 07:59:57 08:00:03 A 08:00:00 08:00:10
D 07:59:57 08:00:03 C 07:59:57 08:00:00
D 07:59:57 08:00:03 D 07:59:57 08:00:03
D 07:59:57 08:00:03 E 08:00:01 08:00:04
E 08:00:01 08:00:04 A 08:00:00 08:00:10
E 08:00:01 08:00:04 D 07:59:57 08:00:03
E 08:00:01 08:00:04 E 08:00:01 08:00:04
F 08:00:07 08:00:10 A 08:00:00 08:00:10
F 08:00:07 08:00:10 F 08:00:07 08:00:10
F 08:00:07 08:00:10 G 08:00:07 08:00:13
G 08:00:07 08:00:13 A 08:00:00 08:00:10
G 08:00:07 08:00:13 F 08:00:07 08:00:10
G 08:00:07 08:00:13 G 08:00:07 08:00:13
G 08:00:07 08:00:13 H 08:00:10 08:00:13
H 08:00:10 08:00:13 G 08:00:07 08:00:13
H 08:00:10 08:00:13 H 08:00:10 08:00:13
I 08:00:15 08:00:18 I 08:00:15 08:00:18
Run Code Online (Sandbox Code Playgroud)
您可以从此表中看到"A"应该计为5,包括其自身."B"应该算1; 它重叠,但没有其他间隔重叠.这似乎是正确的做法.
计数很简单,但就像破裂的乌龟一样.那是因为评估重叠需要做很多工作.
select t1.sid, count(t2.sid) as num_concurrent
from calls_nov t1
inner join calls_nov t2 on (t2.starttime, t2.endtime)
overlaps (t1.starttime, t1.endtime)
group by t1.sid
order by num_concurrent desc;
A 5
D 4
G 4
E 3
F 3
H 2
C 2
I 1
B 1
Run Code Online (Sandbox Code Playgroud)
为了获得更好的性能,您可以在公用表表达式中使用上面的"表",并根据该表进行计数.
with interval_table as (
select t1.sid as sid_1, t1.starttime, t1.endtime,
t2.sid as sid_2, t2.starttime, t2.endtime
from calls_nov t1
inner join calls_nov t2 on (t2.starttime, t2.endtime)
overlaps (t1.starttime, t1.endtime)
order by t1.sid, t2.sid
)
select sid_1, count(sid_2) as num_concurrent
from interval_table
group by sid_1
order by num_concurrent desc;
Run Code Online (Sandbox Code Playgroud)
1.)您的查询没有捕获所有重叠 - 这已由其他答案修复.
2)你列的数据类型starttime和endtime为timestamp.所以你的WHERE条款也有些错误:
BETWEEN '2011-11-02' AND '2011-11-03'
Run Code Online (Sandbox Code Playgroud)
这将包括'2011-11-03 00:00'.必须排除上边界.
3.)删除了不带双引号的混合大小写语法.不带引号的标识符会自动转换为小写.简单来说:最好不要在PostgreSQL中使用混合大小写标识符.
4.)转换查询以使用显式JOIN,这总是更可取的.实际上,我把它设为LEFT [OUTER] JOIN,因为我想计算与其他呼叫重叠的呼叫.
5.)简化语法以获得此基本查询:
SELECT t1.sid, count(*) AS ct
FROM calls_nov t1
LEFT JOIN calls_nov t2 ON t1.starttime <= t2.endtime
AND t1.endtime >= t2.starttime
WHERE t1.starttime >= '2011-11-02 0:0'::timestamp
AND t1.starttime < '2011-11-03 0:0'::timestamp
GROUP BY 1
ORDER BY 2 DESC;
Run Code Online (Sandbox Code Playgroud)
这个查询对于一个大表来说非常慢,因为必须将从'2011-11-02'开始的每一行与整个表中的每一行进行比较,这会导致(几乎)O(n²)成本.
我们可以通过预先选择可能的候选人来大幅降低成本.只选择您需要的列和行.我用两个CTE做这件事.
xy)x. - > CTEyWITH x AS (
SELECT sid, starttime, endtime
FROM calls_nov
WHERE starttime >= '2011-11-02 0:0'
AND starttime < '2011-11-03 0:0'
), y AS (
SELECT starttime, endtime
FROM calls_nov
WHERE endtime >= '2011-11-02 0:0'
AND starttime <= (SELECT max(endtime) As max_endtime FROM x)
)
SELECT x.sid, count(*) AS count_overlaps
FROM x
LEFT JOIN y ON x.starttime <= y.endtime
AND x.endtime >= y.starttime
GROUP BY 1
ORDER BY 2 DESC;
Run Code Online (Sandbox Code Playgroud)
我有一个350.000行的真实生命表,其重叠的开始/结束时间戳与您的相似.我用它作为快速基准测试.PostgreSQL 8.4,稀缺资源,因为它是一个测试DB.索引start和end.(ID列上的索引与此无关.)经过测试EXPLAIN ANALYZE,最好是5.
总运行时间:476994.774 ms
CTE变体:
总运行时间:4199.788 ms - 即>因子100.
添加表单的多列索引后:
CREATE INDEX start_end_index on calls_nov (starttime, endtime);
Run Code Online (Sandbox Code Playgroud)
总运行时间:4159.367 ms
如果这还不够,有办法加快它的另一个数量级.而不是上面的CTE,实现临时表 - 这是关键点 - 在第二个上创建一个索引.看起来像这样:
作为一个事务执行:
CREATE TEMP TABLE x ON COMMIT DROP AS
SELECT sid, starttime, endtime
FROM calls_nov
WHERE starttime >= '2011-11-02 0:0'
AND starttime < '2011-11-03 0:0';
CREATE TEMP TABLE y ON COMMIT DROP AS
SELECT starttime, endtime
FROM calls_nov
WHERE endtime >= '2011-11-02 0:0'
AND starttime <= (SELECT max(endtime) FROM x);
CREATE INDEX y_idx ON y (starttime, endtime); -- this is where the magic happens
SELECT x.sid, count(*) AS ct
FROM x
LEFT JOIN y ON x.starttime <= y.endtime
AND x.endtime >= y.starttime
GROUP BY 1
ORDER BY 2 DESC;
Run Code Online (Sandbox Code Playgroud)
阅读手册中的临时表.
创建一个封装魔术的plpgsql函数.
诊断临时表的典型大小.独立创建它们并测量:
SELECT pg_size_pretty(pg_total_relation_size('tmp_tbl'));
Run Code Online (Sandbox Code Playgroud)如果它们大于temp_buffers的设置,则在函数中暂时将它们设置得足够高,以便将临时表保存在RAM中.如果您不必更换光盘,这是一个主要的加速.(必须首先使用会话中的临时表才能生效.)
CREATE OR REPLACE FUNCTION f_call_overlaps(date)
RETURNS TABLE (sid varchar, ct integer) AS
$BODY$
DECLARE
_from timestamp := $1::timestamp;
_to timestamp := ($1 +1)::timestamp;
BEGIN
SET temp_buffers = 64MB'; -- example value; more RAM for temp tables;
CREATE TEMP TABLE x ON COMMIT DROP AS
SELECT c.sid, starttime, endtime -- avoid naming conflict with OUT param
FROM calls_nov c
WHERE starttime >= _from
AND starttime < _to;
CREATE TEMP TABLE y ON COMMIT DROP AS
SELECT starttime, endtime
FROM calls_nov
WHERE endtime >= _from
AND starttime <= (SELECT max(endtime) FROM x);
CREATE INDEX y_idx ON y (starttime, endtime);
RETURN QUERY
SELECT x.sid, count(*)::int -- AS ct
FROM x
LEFT JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime
GROUP BY 1
ORDER BY 2 DESC;
END;
$BODY$ LANGUAGE plpgsql;
Run Code Online (Sandbox Code Playgroud)
呼叫:
SELECT * FROM f_call_overlaps('2011-11-02') -- just name your date
Run Code Online (Sandbox Code Playgroud)
总运行时间:138.169 ms - 这是因素3000
CLUSTER calls_nov USING starttime_index; -- this also vacuums the table fully
ANALYZE calls_nov;
Run Code Online (Sandbox Code Playgroud)