Nie*_*ein 32 python sqlite performance fetchall
我一直在努力加速我现在使用大约一周的查询,并在这里询问了几个问题(如何在运行sqlite查询后加快获取结果?,sqlite.fetchall()是否正常是如此之慢?,如何以有效的方式使用min()和max()?).
从那里给出的答案非常有用的帮助下,我设法让时间缩短到sqlite的查询到100.95秒,使用fetchall服用:1485.43.这仍然是不够的,所以在尝试了一些不同的索引后,我设法将查询时间缩短到0.08一个样本的秒数,并将获取时间缩短到54.97几秒钟.所以我觉得我终于设法加快了速度.
然后查询运行下一个样本,花费0.58几秒钟,并且fetchall花费3952.80几秒钟.对于第三个样本,查询花了1.01几秒钟,花了1970.67几秒钟来取消.
第一个样本获取12951行,第二个样本获取24972行,第三个6470行.我很好奇为什么第一个样本获取行的速度要快得多,因为它只有大约一半的量来获取第二个例子.
代码(spectrumFeature_inputValues是(1,),(2,)和(3,),来自使用的3个样本.):
self.cursor.execute('begin')
self.cursor.execute("EXPLAIN QUERY PLAN "+
"SELECT precursor_id, feature_table_id "+
"FROM `MSMS_precursor` "+
"INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
"INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
"WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+
"AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+
"AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
print 'EXPLAIN QUERY PLAN: '
print self.cursor.fetchall()
import time
time0 = time.time()
self.cursor.execute("SELECT precursor_id, feature_table_id "+
"FROM `MSMS_precursor` "+
"INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
"INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
"WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+
"AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+
"AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
print 'query took:',time.time()-time0,'seconds'
time0 = time.time()
precursorFeatureIds = self.cursor.fetchall()
print 'it fetched:',len(precursorFeatureIds),'rows'
print 'fetchall took',time.time()-time0,'seconds'
time0 = time.time()
for precursorAndFeatureID in precursorFeatureIds:
feature_has_MSMS_precursor_inputValues = (precursorAndFeatureID[0], precursorAndFeatureID[1])
self.cursor.execute("INSERT INTO `feature_has_MSMS_precursor` VALUES(?,?)", feature_has_MSMS_precursor_inputValues)
print 'inserting took',time.time()-time0,'seconds'
self.connection.commit()
Run Code Online (Sandbox Code Playgroud)
结果:
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0754859447479 seconds
it fetched: 12951 rows
fetchall took 54.2855291367 seconds
inserting took 0.602859973907 seconds
It took 54.9704811573 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.579694032669 seconds
it fetched: 24972 rows
fetchall took 3950.08093309 seconds
inserting took 2.11575508118 seconds
It took 3952.80745602 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 1.01185703278 seconds
it fetched: 6470 rows
fetchall took 1970.622962 seconds
inserting took 0.673867940903 seconds
It took 1972.31343699 seconds
Run Code Online (Sandbox Code Playgroud)
SQLite创建语句:
-- -----------------------------------------------------
-- Table `feature`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `feature` (
`feature_table_id` INT PRIMARY KEY NOT NULL ,
`feature_id` VARCHAR(40) NOT NULL ,
`intensity` DOUBLE NOT NULL ,
`overallquality` DOUBLE NOT NULL ,
`charge` INT NOT NULL ,
`content` VARCHAR(45) NOT NULL ,
`intensity_cutoff` DOUBLE NOT NULL,
`mzMin` DOUBLE NULL ,
`mzMax` DOUBLE NULL ,
`rtMin` DOUBLE NULL ,
`rtMax` DOUBLE NULL ,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_feature_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_mzMin_feature` ON `feature` (`mzMin` ASC);
CREATE INDEX `fk_mzMax_feature` ON `feature` (`mzMax` ASC);
CREATE INDEX `fk_rtMin_feature` ON `feature` (`rtMin` ASC);
CREATE INDEX `fk_rtMax_feature` ON `feature` (`rtMax` ASC);
DROP TABLE IF EXISTS `spectrum`;
-- -----------------------------------------------------
-- Table `spectrum`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `spectrum` (
`spectrum_id` INT PRIMARY KEY NOT NULL ,
`spectrum_index` INT NOT NULL ,
`ms_level` INT NOT NULL ,
`base_peak_mz` DOUBLE NOT NULL ,
`base_peak_intensity` DOUBLE NOT NULL ,
`total_ion_current` DOUBLE NOT NULL ,
`lowest_observes_mz` DOUBLE NOT NULL ,
`highest_observed_mz` DOUBLE NOT NULL ,
`scan_start_time` DOUBLE NOT NULL ,
`ion_injection_time` DOUBLE,
`binary_data_mz` BLOB NOT NULL,
`binary_data_rt` BLOB NOT NULL,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_spectrum_spectrum_id_1` ON `spectrum` (`spectrum_id` ASC);
CREATE INDEX `fk_spectrum_scahn_start_time_1` ON `spectrum` (`scan_start_time` ASC);
DROP TABLE IF EXISTS `feature_has_MSMS_precursor`;
-- -----------------------------------------------------
-- Table `spectrum_has_feature`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `feature_has_MSMS_precursor` (
`MSMS_precursor_precursor_id` INT NOT NULL ,
`feature_feature_table_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_has_feature_spectrum1`
FOREIGN KEY (`MSMS_precursor_precursor_id` )
REFERENCES `MSMS_precursor` (`precursor_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `fk_spectrum_has_feature_feature1`
FOREIGN KEY (`feature_feature_table_id` )
REFERENCES `feature` (`feature_table_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_feature_has_MSMS_precursor_feature1` ON `feature_has_MSMS_precursor` (`feature_feature_table_id` ASC);
CREATE INDEX `fk_feature_has_MSMS_precursor_precursor1` ON `feature_has_MSMS_precursor` (`MSMS_precursor_precursor_id` ASC);
Run Code Online (Sandbox Code Playgroud)
正如你所看到的,我已经在频谱和特征中创建了索引mz和rt值,因为我认为大部分时间用于将这些数字进行比较.
那么为什么第一个样本比第二个和第三个样本快得多?查询时间如何与胎儿时间相关?最重要的是,有没有办法可以加快速度?
与同事谈话后,可能是因为将点与2d维度(rtMin,rtMax,mzMin,mzMax)进行比较将需要n ^ 2次.这大致对应于第二次购买花费的时间超过60 ^ 2秒(第一次购买时间的近似时间)并且它检索的行数少于行数的两倍.但这并不能回答我的任何问题.
我在评论中尝试使用R*树.我做了一张新桌子:
CREATE VIRTUAL TABLE convexhull_edges USING rtree(
feature_feature_table_id,
rtMin, rtMax,
mzMin, mzMax,
);
Run Code Online (Sandbox Code Playgroud)
并将我的查询更改为:
self.cursor.execute("SELECT precursor_id, feature_table_id "+
"FROM `MSMS_precursor` "+
"INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
"INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
"INNER JOIN `convexhull_edges` ON convexhull_edges.feature_feature_table_id = feature.feature_table_id "
"WHERE spectrum.scan_start_time BETWEEN convexhull_edges.rtMin AND convexhull_edges.rtMax "+
"AND MSMS_precursor.ion_mz BETWEEN convexhull_edges.mzMin AND convexhull_edges.mzMax "+
"AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
Run Code Online (Sandbox Code Playgroud)
这给出了以下结果:
EXPLAIN QUERY PLAN:
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0572800636292 seconds
it fetched: 13140 rows
fetchall took 34.4445540905 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.819370031357 seconds
it fetched: 25402 rows
fetchall took 3625.72873998 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.878498077393 seconds
it fetched: 6761 rows
fetchall took 1419.34246588 seconds
inserting took 0.340960025787 seconds
It took 1420.56637716 seconds
Run Code Online (Sandbox Code Playgroud)
所以比我以前的方式快一点,但仍然不够快.接下来我将尝试web_bod的解决方案.
使用web_bod的解决方案我得到以下时间:
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0521960258484 seconds
it fetched: 13052 rows
fetchall took 90.5810132027 seconds
EXPLAIN QUERY PLAN:
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.278959989548 seconds
it fetched: 25195 rows
fetchall took 4310.6012361 seconds
Run Code Online (Sandbox Code Playgroud)
由于重启,第三个遗憾地没有完成.所以这比我的第一个解决方案快一点,但比使用R*Tree慢
处理一个非常慢的不同查询,我发现它正在进入一个不间断的睡眠(见这个问题).所以我在运行此查询时检查了顶部并且它在R和D状态之间切换,将CPU使用率从100降低到50%.这可能就是为什么它提供所有解决方案的速度都很慢.
我迁移到MySQL,但我得到了相同的结果.
执行时间几何上与每个表中的行数成比例,而不是算术上,例如
3 tables with 10 rows each => 1,000 comparision
3 tables with 10, 10 and 40 rows => 4,000 comparisons
3 tables with 20 rows each => 8,000 comparisons
Run Code Online (Sandbox Code Playgroud)
您可能可能会重新考虑查询以避免某些连接/游标 - 您何时需要答案?
你能做这样的事吗:
SELECT precursor_id, feature_table_id
FROM MSMS_precursor
INNER JOIN
(
SELECT mzMin, mzMax, rtMin, rtMax, spectrum_id, feature_table_id, msrun_msrun_id
FROM spectrum
INNER JOIN
(select feature_table_id, mzMin, mzMax, rtMin, rtMax, msrun_msrun_id
from feature
where feature.msrun_msrun_id = 'value'
) subquery
ON subquery.msrun_msrun_id = spectrum.msrun_msrun_id
WHERE
spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax
) subquery
ON subquery.spectrum_id = MSMS_precursor.spectrum_spectrum_id
WHERE
MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax
Run Code Online (Sandbox Code Playgroud)
使用子查询可以减少表之间的比较次数 - 您可以在搜索合适的前体之前快速过滤掉不需要的特征,然后过滤掉不相关的光谱.
我不使用SQLLite - 但原则仍应适用.
更新:修复SQL中的错误
笔记:
你不必担心AND,你只会得到:
更新18/5月:
这是索引!!! 您在搜索字段上有索引,但在参与连接的字段上没有 - 外键索引确实提高了性能:
CREATE INDEX `fk_msrun_msrun_id_feature` ON `feature` (`msrun_msrun_id` ASC);
CREATE INDEX `fk_spectrum_spectrum_id_feature` ON `feature` (`msrun_msrun_id` ASC);
CREATE INDEX `fk_spectrum_spectrum_id_MSMS_precursor` ON `MSMS_precursor` (`spectrum_spectrum_id` ASC);
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6882 次 |
| 最近记录: |