shu*_*son 6 sql postgresql performance exists data-modeling
假设我有一个简化的模型,其中a patient可以有零或更多events.一个事件有a category和a date.我想支持以下问题:
Find all patients that were given a medication after an operation and
the operation happened after an admission.
Run Code Online (Sandbox Code Playgroud)
药物,手术和入院是所有类型的事件类别.有大约100种可能的类别.
我期待着1000名患者,每个患者每个类别有大约10个事件.
我想出的天真的解决方案是有两张桌子,一张桌子patient和一张event桌子.创建索引event.category,然后使用内部联接查询,如:
SELECT COUNT(DISTINCT(patient.id)) FROM patient
INNER JOIN event AS medication
ON medication.patient_id = patient.id
AND medication.category = 'medication'
INNER JOIN event AS operation
ON operation.patient_id = patient.id
AND operation.category = 'operation'
INNER JOIN event AS admission
ON admission.patient_id = patient.id
AND admission.category = 'admission'
WHERE medication.date > operation.date
AND operation.date > admission.date;
Run Code Online (Sandbox Code Playgroud)
但是,随着更多类别/过滤器的添加,此解决方案无法很好地扩展.有1,000名患者和45,000个事件,我看到以下表现行为:
| number of inner joins | approx. query response |
| --------------------- | ---------------------- |
| 2 | 100ms |
| 3 | 500ms |
| 4 | 2000ms |
| 5 | 8000ms |
Run Code Online (Sandbox Code Playgroud)
有没有人对如何优化此查询/数据模型有任何建议?
额外信息:
project_result相当于patient简化模型.高级用例:
Find all patients that were given a medication within 30 days after an
operation and the operation happened within 7 days after an admission.
Run Code Online (Sandbox Code Playgroud)
首先,如果使用 FK 约束强制执行参照完整性,则可以patient从查询中完全删除该表:
SELECT COUNT(DISTINCT patient) -- still not optimal
FROM event a
JOIN event o USING (patient_id)
JOIN event m USING (patient_id)
WHERE a.category = 'admission'
AND o.category = 'operation'
AND m.category = 'medication'
AND m.date > o.date
AND o.date > a.date;
Run Code Online (Sandbox Code Playgroud)
接下来,摆脱行的重复乘法,并通过使用半连接来DISTINCT抵消外部的重复乘法:SELECTEXISTS
SELECT COUNT(*)
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date > a.date
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date > o.date
)
)
AND a.category = 'admission';
Run Code Online (Sandbox Code Playgroud)
请注意,入场中仍然可能存在重复项,但这可能是数据模型/查询设计中的主要问题,并且需要按照评论中的讨论进行澄清。
如果出于某种原因您确实想将同一患者的所有病例集中在一起,可以采用多种方法在初始步骤中让每位患者尽早入院,并为每个其他步骤重复类似的方法。对于您的情况来说可能是最快的(将患者表重新引入查询):
SELECT count(*)
FROM patient p
CROSS JOIN LATERAL ( -- get earliest admission
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'admission'
ORDER BY e.date
LIMIT 1
) a
CROSS JOIN LATERAL ( -- get earliest operation after that
SELECT e.date
FROM event e
WHERE e.patient_id = p.id
AND e.category = 'operation'
AND e.date > a.date
ORDER BY e.date
LIMIT 1
) o
WHERE EXISTS ( -- the *last* step can still be a plain EXISTS
SELECT FROM event m
WHERE m.patient_id = p.id
AND m.category = 'medication'
AND m.date > o.date
);
Run Code Online (Sandbox Code Playgroud)
看:
您可以通过缩短冗长(且冗余)的类别名称来优化表设计。使用查找表并仅存储一个integer(或偶数int2或"char"值作为FK。)
为了获得最佳性能(这一点至关重要),请启用多列索引(parent_id, category, date DESC)并确保定义了所有三列NOT NULL。索引表达式的顺序很重要。DESC这里主要是可选的。Postgres 可以在您的情况下使用具有默认排序顺序的索引,ASC几乎同样有效。
如果VACUUM(最好以 autovacuum 的形式)可以跟上写入操作,或者您一开始就有只读情况,那么您将获得非常快速的仅索引扫描。
有关的:
要实现额外的时间范围(您的“高级用例”),请以第二个查询为基础,因为我们必须再次考虑所有事件。
您确实应该有病例 ID 或更明确的信息,以将手术与入院、药物与手术等相关联。(可能只是id引用事件的事件!)单独的日期/时间戳很容易出错。
SELECT COUNT(*) -- to count cases
-- COUNT(DISTINCT patient_id) -- to count patients
FROM event a
WHERE EXISTS (
SELECT FROM event o
WHERE o.patient_id = a.patient_id
AND o.category = 'operation'
AND o.date >= a.date -- or ">"
AND o.date < a.date + 7 -- based on data type "date"!
AND EXISTS (
SELECT FROM event m
WHERE m.patient_id = a.patient_id
AND m.category = 'medication'
AND m.date >= o.date -- or ">"
AND m.date < o.date + 30 -- syntax for timestamp is different
)
)
AND a.category = 'admission';
Run Code Online (Sandbox Code Playgroud)
关于date/timestamp算术:
| 归档时间: |
|
| 查看次数: |
91 次 |
| 最近记录: |