优化具有数亿行的表的查询

Question

优化具有数亿行的表的查询

这感觉就像是"为我做我的功课"这样的问题,但我真的被困在这里试图使这个查询快速运行对很多行的表.这是一个显示架构的SQLFiddle(或多或少).

我已经使用了索引,试图获得一些能够显示所有必需列但却没有取得多大成功的东西.这是create:

CREATE TABLE `AuditEvent` (
    `auditEventId` bigint(20) NOT NULL AUTO_INCREMENT,
    `eventTime` datetime NOT NULL,
    `target1Id` int(11) DEFAULT NULL,
    `target1Name` varchar(100) DEFAULT NULL,
    `target2Id` int(11) DEFAULT NULL,
    `target2Name` varchar(100) DEFAULT NULL,
    `clientId` int(11) NOT NULL DEFAULT '1',
    `type` int(11) not null,
    PRIMARY KEY (`auditEventId`),
    KEY `Transactions` (`clientId`,`eventTime`,`target1Id`,`type`),
    KEY `TransactionsJoin` (`auditEventId`, `clientId`,`eventTime`,`target1Id`,`type`)
)

Run Code Online (Sandbox Code Playgroud)

和(的一个版本)select:

select ae.target1Id, ae.type, count(*)
from AuditEvent ae
where ae.clientId=4
    and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type;

Run Code Online (Sandbox Code Playgroud)

我最终得到了一个'Using temporary'和'Using filesort'.我尝试删除count(*)和使用select distinct,而不是导致'使用filesort'.如果有办法join回来获得计数,这可能会没问题.

最初,决定跟踪目标的target1Name和target2Name,因为它们在创建审计记录时存在.我也需要这些名字(最近会这样做).

目前,查询(上面,缺少target1Name和target2Name列)在大约5秒内运行~2400万条记录.我们的目标是数亿,我们希望查询继续按照这些方式执行(希望将其保持在1-2分钟之内,但我们希望它能更好),但我担心的是我们点击了它不会有更多的数据(正在进行模拟其他行的工作).

我不确定获得额外字段的最佳策略.如果我直接添加列,select我会丢失查询中的"使用索引".我试着join回到桌子,它保持'使用索引',但需要大约20秒.

我确实尝试将eventTime列更改为int而不是datetime,但这似乎不会影响索引的使用或时间.

Answer 1

new*_*ver 5

正如您可能理解的那样，这里的问题是范围条件ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00'（一如既往）破坏了Transactions索引的有效使用（即索引实际上仅用于clientId方程和范围条件的第一部分，而索引不用于分组） .

大多数情况下，解决方案是用相等检查替换范围条件（在您的情况下，引入一period列，分组eventTime为句点并将BETWEEN子句替换为period IN (1,2,3,4,5)）。但这可能会成为您表的开销。

您可能尝试的另一种解决方案是添加另一个索引（Transactions如果不再使用，可能会替换）: (clientId, target1Id, type, eventTime)，并使用以下查询：

SELECT
  ae.target1Id,
  ae.type,
  COUNT(
    NULLIF(ae.eventTime BETWEEN '2011-09-01 03:00:00' 
                            AND '2012-09-30 23:57:00', 0)
  ) as cnt,
FROM AuditEvent ae
WHERE ae.clientId=4
GROUP BY ae.target1Id, ae.type;

Run Code Online (Sandbox Code Playgroud)

这样，您将 a) 将范围条件移到末尾，b) 允许使用索引进行分组，c) 使索引成为覆盖索引成为查询（即查询不需要磁盘 IO 操作）

UPD1： 不好意思，昨天没仔细看你的帖子，没注意到你的问题是检索target1Name和target2Name。首先，我不确定您是否正确理解Using index. 不存在Using index并不意味着没有索引用于查询，Using index意味着索引本身包含足够的数据来执行子查询（即索引正在覆盖）。由于target1Name和target2Name不包含在任何索引中，因此获取它们的子查询不会有Using index.

如果您的问题只是如何将这两个字段添加到您的查询中（您认为它足够快），那么请尝试以下操作：

SELECT a1.target1Id, a1.type, cnt, target1Name, target2Name
FROM (
  select ae.target1Id, ae.type, count(*) as cnt, MAX(auditEventId) as max_id
  from AuditEvent ae
  where ae.clientId=4
      and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
  group by ae.target1Id, ae.type) as a1
JOIN AuditEvent a2 ON a1.max_id = a2.auditEventId
;

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，3 月前
查看次数：	2151 次
最近记录：	13 年，3 月前