如何优化查询以计算与行相关的日期时间关系?

shu*_*son 6 sql postgresql performance exists data-modeling

假设我有一个简化的模型,其中a patient可以有零或更多events.一个事件有a category和a date.我想支持以下问题:

Find all patients that were given a medication after an operation and 
the operation happened after an admission. 
Run Code Online (Sandbox Code Playgroud)

药物,手术和入院是所有类型的事件类别.有大约100种可能的类别.

我期待着1000名患者,每个患者每个类别有大约10个事件.

我想出的天真的解决方案是有两张桌子,一张桌子patient和一张event桌子.创建索引event.category,然后使用内部联接查询,如:

SELECT COUNT(DISTINCT(patient.id)) FROM patient
INNER JOIN event AS medication
    ON  medication.patient_id = patient.id
    AND medication.category = 'medication'
INNER JOIN event AS operation
    ON  operation.patient_id = patient.id
    AND operation.category = 'operation'
INNER JOIN event AS admission
    ON  admission.patient_id = patient.id
    AND admission.category = 'admission'
WHERE medication.date > operation.date
    AND operation.date > admission.date;
Run Code Online (Sandbox Code Playgroud)

但是,随着更多类别/过滤器的添加,此解决方案无法很好地扩展.有1,000名患者和45,000个事件,我看到以下表现行为:

| number of inner joins | approx. query response |
| --------------------- | ---------------------- |
| 2                     | 100ms                  |
| 3                     | 500ms                  |
| 4                     | 2000ms                 |
| 5                     | 8000ms                 | 
Run Code Online (Sandbox Code Playgroud)

说明: 说明

有没有人对如何优化此查询/数据模型有任何建议?

额外信息:

  • Postgres 10.6
  • 在Explain输出中,project_result相当于patient简化模型.

高级用例:

Find all patients that were given a medication within 30 days after an 
operation and the operation happened within 7 days after an admission.
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 4

首先,如果使用 FK 约束强制执行参照完整性,则可以patient从查询中完全删除该表:

SELECT COUNT(DISTINCT patient)  -- still not optimal
FROM   event a
JOIN   event o USING (patient_id)
JOIN   event m USING (patient_id)
WHERE  a.category = 'admission'
AND    o.category = 'operation'
AND    m.category = 'medication'
AND    m.date > o.date
AND    o.date > a.date;
Run Code Online (Sandbox Code Playgroud)

接下来,摆脱行的重复乘法,并通过使用半连接来DISTINCT抵消外部的重复乘法:SELECTEXISTS

SELECT COUNT(*)
FROM   event a
WHERE  EXISTS (
   SELECT FROM event o
   WHERE  o.patient_id = a.patient_id
   AND    o.category = 'operation'
   AND    o.date > a.date
   AND    EXISTS (
      SELECT FROM event m
      WHERE  m.patient_id = a.patient_id
      AND    m.category = 'medication'
      AND    m.date > o.date
      )
   )
AND    a.category = 'admission';
Run Code Online (Sandbox Code Playgroud)

请注意,入场中仍然可能存在重复项,但这可能是数据模型/查询设计中的主要问题,并且需要按照评论中的讨论进行澄清。

如果出于某种原因您确实想将同一患者的所有病例集中在一起,可以采用多种方法在初始步骤中让每位患者尽早入院,并为每个其他步骤重复类似的方法。对于您的情况来说可能是最快的(将患者表重新引入查询):

SELECT count(*)
FROM   patient p
CROSS  JOIN LATERAL ( -- get earliest admission
   SELECT e.date
   FROM   event e
   WHERE  e.patient_id = p.id 
   AND    e.category = 'admission'
   ORDER  BY e.date
   LIMIT  1
   ) a
CROSS  JOIN LATERAL ( -- get earliest operation after that
   SELECT e.date
   FROM   event e
   WHERE  e.patient_id = p.id 
   AND    e.category = 'operation'
   AND    e.date > a.date
   ORDER  BY e.date
   LIMIT  1
   ) o
WHERE EXISTS (  -- the *last* step can still be a plain EXISTS
      SELECT FROM event m
      WHERE  m.patient_id = p.id
      AND    m.category = 'medication'
      AND    m.date > o.date
      );
Run Code Online (Sandbox Code Playgroud)

看:

您可以通过缩短冗长(且冗余)的类别名称来优化表设计。使用查找表并仅存储一个integer(或偶数int2"char"值作为FK。)

为了获得最佳性能(这一点至关重要),请启用多列索引(parent_id, category, date DESC)并确保定义了所有三列NOT NULL。索引表达式的顺序很重要。DESC这里主要是可选的。Postgres 可以在您的情况下使用具有默认排序顺序的索引,ASC几乎同样有效。

如果VACUUM(最好以 autovacuum 的形式)可以跟上写入操作,或者您一开始就有只读情况,那么您将获得非常快速的仅索引扫描

有关的:


要实现额外的时间范围(您的“高级用例”),请以第二个查询为基础,因为我们必须再次考虑所有事件。

您确实应该有病例 ID 或更明确的信息,以将手术与入院、药物与手术等相关联。(可能只是id引用事件的事件!)单独的日期/时间戳很容易出错。

SELECT COUNT(*)                    -- to count cases
   --  COUNT(DISTINCT patient_id)  -- to count patients
FROM   event a
WHERE  EXISTS (
   SELECT FROM event o
   WHERE  o.patient_id = a.patient_id
   AND    o.category = 'operation'
   AND    o.date >= a.date      -- or ">"
   AND    o.date <  a.date + 7  -- based on data type "date"!
   AND    EXISTS (
      SELECT FROM event m
      WHERE  m.patient_id = a.patient_id
      AND    m.category = 'medication'
      AND    m.date >= o.date       -- or ">"
      AND    m.date <  o.date + 30  -- syntax for timestamp is different
      )
   )
AND    a.category = 'admission';
Run Code Online (Sandbox Code Playgroud)

关于date/timestamp算术: