SQLite3查询优化连接与子选择

Question

SQLite3查询优化连接与子选择

Gra*_*yer 10 sql database sqlite query-optimization

我试图找出最好的方法,(在这种情况下可能无关紧要)找到一个表的行,基于标志的存在,以及另一个表中的行中的关系id.

这是模式:

    CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

Run Code Online (Sandbox Code Playgroud)

我正在使用SQLite3

文件表会非常大,通常为10K-5M行.resume_points将小于10K,只有1-2个不同scan_file_id的

所以我的第一个想法是:

select distinct files.* from resume_points inner join files
on resume_points.scan_file_id=files.id where files.dirty = 1;

Run Code Online (Sandbox Code Playgroud)

一位同事建议转弯:

select distinct files.* from files inner join resume_points
on files.id=resume_points.scan_file_id where files.dirty = 1;

Run Code Online (Sandbox Code Playgroud)

然后我想,因为我们知道不同的数量scan_file_id会很小,也许子选择是最优的(在这种罕见的情况下):

select * from files where id in (select distinct scan_file_id from resume_points);

Run Code Online (Sandbox Code Playgroud)

在explain分别为42,42,和48:输出具有下列行.

Answer 1

Joh*_*eng 12

TL; DR:最好的查询和索引是:

create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;

Run Code Online (Sandbox Code Playgroud)

由于我通常使用SQL Server,起初我认为查询优化器肯定会找到这种简单查询的最佳执行计划,无论您编写这些等效SQL语句的方式如何.所以我下载了SQLite,并开始玩游戏.令我惊讶的是,性能差异很大.

这是设置代码:

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

Run Code Online (Sandbox Code Playgroud)

我考虑了两个指数:

create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);

Run Code Online (Sandbox Code Playgroud)

以下是我尝试过的查询以及i5笔记本电脑上的执行时间.数据库文件大小只有大约200MB,因为它没有任何其他数据.

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles

Run Code Online (Sandbox Code Playgroud)

看起来SQLite的查询优化器根本不是很先进.最好的查询首先将resume_points减少到少量行(在测试用例中为两行.OP表示它将是1-2.),然后查找文件以查看它是否脏.dirtyFilesindex对任何文件都没有太大影响.我想这可能是因为数据在测试表中的排列方式.它可能会对生产表产生影响.然而,差异不是太大,因为将少于少数几个查找.uniqueFiles确实有所作为,因为它可以将10000行的resume_points减少到2行而不扫描其中的大多数.fileLookup确实提高了一些查询速度,但还不足以显着改变结果.值得注意的是,它组成的速度很慢.总之,尽早减少结果集以产生最大的差异.

归档时间：	12 年，2 月前
查看次数：	3757 次
最近记录：	12 年，2 月前