计算SQL中的并发事件数

Sol*_*oub 12 sql postgresql performance timestamp

我有一个可以拨打电话的表,其中包含以下字段:

  • ID
  • 开始时间
  • 时间结束
  • 状态
  • CALL_FROM
  • 拨电至

有29万条记录加载到本地PostgreSQL数据库中.我在ID(唯一索引),starttime和endtime上添加了索引.

在stackoverflow上搜索,我发现了一些有用的SQL并将其修改为我认为逻辑上应该工作的内容.问题是查询运行了很多个小时,从不返回:

SELECT T1.sid, count(*) as CountSimultaneous
FROM calls_nov T1, calls_nov T2
WHERE
     T1.StartTime between T2.StartTime and T2.EndTime
     and T1.StartTime between '2011-11-02' and '2011-11-03'
GROUP BY
     T1.sid
ORDER BY CountSimultaneous DESC;
Run Code Online (Sandbox Code Playgroud)

有人可以建议一种方法来修复查询/索引,以便它实际工作或建议另一种方法来计算并发调用?

编辑:

解释计划:

Sort  (cost=11796758237.81..11796758679.47 rows=176663 width=35)
  Sort Key: (count(*))
  ->  GroupAggregate  (cost=0.00..11796738007.56 rows=176663 width=35)
        ->  Nested Loop  (cost=0.00..11511290152.45 rows=57089217697 width=35)
Run Code Online (Sandbox Code Playgroud)

表创建脚本:

CREATE TABLE calls_nov (
  sid varchar,
  starttime timestamp, 
  endtime timestamp, 
  call_to varchar, 
  call_from varchar, 
  status varchar);
Run Code Online (Sandbox Code Playgroud)

索引创建:

CREATE UNIQUE INDEX sid_unique_index on calls_nov (sid);

CREATE INDEX starttime_index on calls_nov (starttime);

CREATE INDEX endtime_index on calls_nov (endtime);
Run Code Online (Sandbox Code Playgroud)

Mik*_*ll' 8

这是可能的重叠的样子,其中'A'是"参考"区间.请注意,下面的查询(远远低于)并未给出与已发布的任何答案相同的结果.

-- A            |------|
-- B |-|
-- C        |---|
-- D          |---|
-- E             |---|
-- F               |---|
-- G                 |---|
-- H                   |---|
-- I                       |---|
Run Code Online (Sandbox Code Playgroud)

"B"根本不与"A"重叠."C"紧靠它.{"D","E","F","G"}重叠."H"紧靠它."我"根本不重叠.

create table calls_nov (
  sid varchar(5) primary key,
  starttime timestamp not null,
  endtime timestamp not null
);  

insert into calls_nov values
('A', '2012-01-04 08:00:00', '2012-01-04 08:00:10'),
('B', '2012-01-04 07:50:00', '2012-01-04 07:50:03'),
('C', '2012-01-04 07:59:57', '2012-01-04 08:00:00'),
('D', '2012-01-04 07:59:57', '2012-01-04 08:00:03'),
('E', '2012-01-04 08:00:01', '2012-01-04 08:00:04'),
('F', '2012-01-04 08:00:07', '2012-01-04 08:00:10'),
('G', '2012-01-04 08:00:07', '2012-01-04 08:00:13'),
('H', '2012-01-04 08:00:10', '2012-01-04 08:00:13'),
('I', '2012-01-04 08:00:15', '2012-01-04 08:00:18');
Run Code Online (Sandbox Code Playgroud)

你可以看到这样的所有重叠间隔.(我只是使用to_char()来轻松查看所有数据.您可以在生产中省略它.)

select t1.sid, to_char(t1.starttime, 'HH12:MI:SS'), 
               to_char(t1.endtime,   'HH12:MI:SS'), 
       t2.sid, to_char(t2.starttime, 'HH12:MI:SS'), 
               to_char(t2.endtime,   'HH12:MI:SS')
from calls_nov t1
inner join calls_nov t2 on (t2.starttime, t2.endtime) 
                  overlaps (t1.starttime, t1.endtime) 
order by t1.sid, t2.sid;

A   08:00:00   08:00:10   A   08:00:00   08:00:10
A   08:00:00   08:00:10   D   07:59:57   08:00:03
A   08:00:00   08:00:10   E   08:00:01   08:00:04
A   08:00:00   08:00:10   F   08:00:07   08:00:10
A   08:00:00   08:00:10   G   08:00:07   08:00:13
B   07:50:00   07:50:03   B   07:50:00   07:50:03
C   07:59:57   08:00:00   C   07:59:57   08:00:00
C   07:59:57   08:00:00   D   07:59:57   08:00:03
D   07:59:57   08:00:03   A   08:00:00   08:00:10
D   07:59:57   08:00:03   C   07:59:57   08:00:00
D   07:59:57   08:00:03   D   07:59:57   08:00:03
D   07:59:57   08:00:03   E   08:00:01   08:00:04
E   08:00:01   08:00:04   A   08:00:00   08:00:10
E   08:00:01   08:00:04   D   07:59:57   08:00:03
E   08:00:01   08:00:04   E   08:00:01   08:00:04
F   08:00:07   08:00:10   A   08:00:00   08:00:10
F   08:00:07   08:00:10   F   08:00:07   08:00:10
F   08:00:07   08:00:10   G   08:00:07   08:00:13
G   08:00:07   08:00:13   A   08:00:00   08:00:10
G   08:00:07   08:00:13   F   08:00:07   08:00:10
G   08:00:07   08:00:13   G   08:00:07   08:00:13
G   08:00:07   08:00:13   H   08:00:10   08:00:13
H   08:00:10   08:00:13   G   08:00:07   08:00:13
H   08:00:10   08:00:13   H   08:00:10   08:00:13
I   08:00:15   08:00:18   I   08:00:15   08:00:18
Run Code Online (Sandbox Code Playgroud)

您可以从此表中看到"A"应该计为5,包括其自身."B"应该算1; 它重叠,但没有其他间隔重叠.这似乎是正确的做法.

计数很简单,但就像破裂的乌龟一样.那是因为评估重叠需要做很多工作.

select t1.sid, count(t2.sid) as num_concurrent
from calls_nov t1
inner join calls_nov t2 on (t2.starttime, t2.endtime) 
                  overlaps (t1.starttime, t1.endtime) 
group by t1.sid
order by num_concurrent desc;

A   5
D   4
G   4
E   3
F   3
H   2
C   2
I   1
B   1
Run Code Online (Sandbox Code Playgroud)

为了获得更好的性能,您可以在公用表表达式中使用上面的"表",并根据表进行计数.

with interval_table as (
select t1.sid as sid_1, t1.starttime, t1.endtime,
       t2.sid as sid_2, t2.starttime, t2.endtime
from calls_nov t1
inner join calls_nov t2 on (t2.starttime, t2.endtime) 
                  overlaps (t1.starttime, t1.endtime) 
order by t1.sid, t2.sid
) 
select sid_1, count(sid_2) as num_concurrent
from interval_table
group by sid_1
order by num_concurrent desc;
Run Code Online (Sandbox Code Playgroud)


Erw*_*ter 6

1.)您的查询没有捕获所有重叠 - 这已由其他答案修复.

2)你列的数据类型starttimeendtimetimestamp.所以你的WHERE条款也有些错误:

BETWEEN '2011-11-02' AND '2011-11-03'
Run Code Online (Sandbox Code Playgroud)

这将包括'2011-11-03 00:00'.必须排除上边界.

3.)删除了不带双引号的混合大小写语法.不带引号的标识符会自动转换为小写.简单来说:最好不要在PostgreSQL中使用混合大小写标识符.

4.)转换查询以使用显式JOIN,这总是更可取的.实际上,我把它设为LEFT [OUTER] JOIN,因为我想计算与其他呼叫重叠的呼叫.

5.)简化语法以获得此基本查询:

SELECT t1.sid, count(*) AS ct
FROM   calls_nov t1
LEFT   JOIN calls_nov t2 ON t1.starttime <= t2.endtime
                        AND t1.endtime >= t2.starttime
WHERE  t1.starttime >= '2011-11-02 0:0'::timestamp
AND    t1.starttime <  '2011-11-03 0:0'::timestamp
GROUP  BY 1
ORDER  BY 2 DESC;
Run Code Online (Sandbox Code Playgroud)

这个查询对于一个大表来说非常慢,因为必须将从'2011-11-02'开始的每一行与整个表中的每一行进行比较,这会导致(几乎)O(n²)成本.


快点

我们可以通过预先选择可能的候选人来大幅降低成本.只选择您需要的列和行.我用两个CTE做这件事.

  1. 从相关日期开始选择通话. - > CTEx
  2. 计算这些呼叫的最新结束.(CTE中的子查询y)
  3. 仅选择与CTE总范围重叠的呼叫x. - > CTEy
  4. 最终查询比查询庞大的基础表要快得多.

WITH x AS (
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0'
    ), y AS (
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) As max_endtime FROM x)
    )
SELECT x.sid, count(*) AS count_overlaps
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;
Run Code Online (Sandbox Code Playgroud)

更快

我有一个350.000行的真实生命表,其重叠的开始/结束时间戳与您的相似.我用它作为快速基准测试.PostgreSQL 8.4,稀缺资源,因为它是一个测试DB.索引startend.(ID列上的索引与此无关.)经过测试EXPLAIN ANALYZE,最好是5.

总运行时间:476994.774 ms

CTE变体:
总运行时间:4199.788 ms - 即>因子100.

添加表单的多列索引后:

CREATE INDEX start_end_index on calls_nov (starttime, endtime);
Run Code Online (Sandbox Code Playgroud)

总运行时间:4159.367 ms


终极速度

如果这还不够,有办法加快它的另一个数量级.而不是上面的CTE,实现临时表 - 这是关键点 - 在第二个上创建一个索引.看起来像这样:

作为一个事务执行:

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0';

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime); -- this is where the magic happens

SELECT x.sid, count(*) AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;
Run Code Online (Sandbox Code Playgroud)

阅读手册中的临时表.


终极解决方案

  • 创建一个封装魔术的plpgsql函数.

  • 诊断临时表的典型大小.独立创建它们并测量:

    SELECT pg_size_pretty(pg_total_relation_size('tmp_tbl'));
    
    Run Code Online (Sandbox Code Playgroud)
  • 如果它们大于temp_buffers的设置,则在函数中暂时将它们设置得足够高,以便将临时表保存在RAM中.如果您不必更换光盘,这是一个主要的加速.(必须首先使用会话中的临时表才能生效.)

CREATE OR REPLACE FUNCTION f_call_overlaps(date)
  RETURNS TABLE (sid varchar, ct integer) AS
$BODY$
DECLARE
    _from timestamp := $1::timestamp;
    _to   timestamp := ($1 +1)::timestamp;
BEGIN

SET temp_buffers = 64MB'; -- example value; more RAM for temp tables;

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT c.sid, starttime, endtime  -- avoid naming conflict with OUT param
    FROM   calls_nov c
    WHERE  starttime >= _from
    AND    starttime <  _to;

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= _from
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime);

RETURN QUERY
SELECT x.sid, count(*)::int -- AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

END;
$BODY$   LANGUAGE plpgsql;
Run Code Online (Sandbox Code Playgroud)

呼叫:

SELECT * FROM f_call_overlaps('2011-11-02') -- just name your date
Run Code Online (Sandbox Code Playgroud)

总运行时间:138.169 ms - 这是因素3000


你还能做些什么来加快速度呢?

一般性能优化.

CLUSTER calls_nov USING starttime_index; -- this also vacuums the table fully

ANALYZE calls_nov;
Run Code Online (Sandbox Code Playgroud)

  • @Sologoub:我在答案中添加了更多内容. (2认同)