我试图找出最好的方法,(在这种情况下可能无关紧要)根据标志的存在和另一个表的行中的关系ID来查找一个表的行.
这里是模式:
CREATE TABLE files ( id INTEGER PRIMARY KEY,dirty INTEGER NOT NULL); CREATE TABLE resume_points ( id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,scan_file_id INTEGER NOT NULL );
我正在使用sqlite3
那里的文件表会很大,一般10K-5M行.
resume_points将小于10K,只有1-2个不同的scan_file_id
所以我的第一个想法是:
select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
一个同事建议把这个联合:
select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
那么我以为,因为我们知道不同的scan_file_id的数量会很小,也许一个子选择是最佳的(在这个罕见的情况下):
select * from files where id in (select distinct scan_file_id from resume_points);
解释输出分别具有以下行:42,42和48.
TL; DR:最好的查询和索引是:
create index uniqueFiles on resume_points (scan_file_id); select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
由于我通常使用sql Server,起初我认为查询优化器肯定会找到这样一个简单查询的最佳执行计划,无论你编写这些等效的sql语句是什么.所以我下载了sqlite,并开始玩耍.令我吃惊的是,表现有很大差异.
以下是设置代码:
CREATE TABLE files ( id INTEGER PRIMARY KEY autoincrement,dirty INTEGER NOT NULL); CREATE TABLE resume_points ( id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,scan_file_id INTEGER NOT NULL ); insert into files (dirty) values (0); insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files; insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000; insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;
我考虑了两个指标:
create index dirtyFiles on files (dirty,id); create index uniqueFiles on resume_points (scan_file_id); create index fileLookup on files (id);
以下是我尝试的查询和i5笔记本电脑的执行时间.数据库文件大小只有大约200MB,因为它没有任何其他数据.
select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1; 4.3 - 4.5ms with and without index select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1; 4.4 - 4.7ms with and without index select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1; 2.0 - 2.5ms with uniqueFiles 2.6-2.9ms without uniqueFiles select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1; 2.1 - 2.5ms with uniqueFiles 2.6-3ms without uniqueFiles SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id WHERE f.dirty = 1 GROUP BY f.id 4500 - 6190 ms with uniqueFiles 8.8-9.5 ms without uniqueFiles 14000 ms with uniqueFiles and fileLookup select * from files where exists ( select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1; 8400 ms with uniqueFiles 7400 ms without uniqueFiles
看起来sqlite的查询优化器根本不是很先进.最好的查询首先将resume_points减少到少量行(两个在测试用例中,OP表示将是1-2),然后查找文件以查看它是否脏. dirtyFiles索引对于任何文件没有太大的区别.我认为这可能是因为数据在测试表中排列的方式.它可能会对生产表产生影响.但是,差异不会太大,因为会少于几次查找. uniqueFiles确实有所作为,因为它可以将10000行resume_points减少到2行,而不扫描大部分. fileLookup确实做了一些查询更快,但还不足以显着改变结果.值得注意的是它使得组织非常缓慢.总之,尽早减少结果集,使之产生最大的区别.