Nie*_*ein 4 python sql sqlite comparison performance
我问了两个相关的问题(我如何可以加快运行SQLite的查询之后获取的结果?而且它是正常的,sqlite.fetchall()是如此之慢?).我已经改变了一些东西并获得了一些加速,但是select语句完成仍需要一个多小时.
我有一个表中feature包含的rtMin,rtMax,mzMin和mzMax值.这些值一起是矩形的角(如果你读了我的旧问题,我会分别保存这些值,而不是从convexhull表中获取min()和max(),工作得更快).
我有一个表spectrum与rt和mz值.我有哪些链接特征谱当一个表rt和mz频谱的值是在该特征的矩形.
为此,我使用以下sql和python代码来检索频谱和功能的ID:
self.cursor.execute("SELECT spectrum_id, feature_table_id "+
"FROM `spectrum` "+
"INNER JOIN `feature` "+
"ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
"WHERE spectrum.scan_start_time >= feature.rtMin "+
"AND spectrum.scan_start_time <= feature.rtMax "+
"AND spectrum.base_peak_mz >= feature.mzMin "+
"AND spectrum.base_peak_mz <= feature.mzMax")
spectrumAndFeature_ids = self.cursor.fetchall()
for spectrumAndFeature_id in spectrumAndFeature_ids:
spectrum_has_feature_inputValues = (spectrumAndFeature_id[0], spectrumAndFeature_id[1])
self.cursor.execute("INSERT INTO `spectrum_has_feature` VALUES (?,?)",spectrum_has_feature_inputValues)
Run Code Online (Sandbox Code Playgroud)
我定时执行,fetchall和插入时间并获得以下内容:
query took: 74.7989799976 seconds
5888.845541 seconds since fetchall
returned a length of: 10822
inserting all values took: 3.29669690132 seconds
Run Code Online (Sandbox Code Playgroud)
所以这个查询需要大约一个半小时,大部分时间都在做fetchall().我怎样才能加快速度呢?我应该做的rt和mz比较的Python代码?
为了显示我得到的索引,这里是表的create语句:
CREATE TABLE IF NOT EXISTS `feature` (
`feature_table_id` INT PRIMARY KEY NOT NULL ,
`feature_id` VARCHAR(40) NOT NULL ,
`intensity` DOUBLE NOT NULL ,
`overallquality` DOUBLE NOT NULL ,
`charge` INT NOT NULL ,
`content` VARCHAR(45) NOT NULL ,
`intensity_cutoff` DOUBLE NOT NULL,
`mzMin` DOUBLE NULL ,
`mzMax` DOUBLE NULL ,
`rtMin` DOUBLE NULL ,
`rtMax` DOUBLE NULL ,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_feature_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE UNIQUE INDEX `id_UNIQUE` ON `feature` (`feature_table_id` ASC);
CREATE INDEX `fk_feature_msrun1` ON `feature` (`msrun_msrun_id` ASC);
CREATE TABLE IF NOT EXISTS `spectrum` (
`spectrum_id` INT PRIMARY KEY NOT NULL ,
`spectrum_index` INT NOT NULL ,
`ms_level` INT NOT NULL ,
`base_peak_mz` DOUBLE NOT NULL ,
`base_peak_intensity` DOUBLE NOT NULL ,
`total_ion_current` DOUBLE NOT NULL ,
`lowest_observes_mz` DOUBLE NOT NULL ,
`highest_observed_mz` DOUBLE NOT NULL ,
`scan_start_time` DOUBLE NOT NULL ,
`ion_injection_time` DOUBLE,
`binary_data_mz` BLOB NOT NULL,
`binaray_data_rt` BLOB NOT NULL,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_spectrum_msrun1` ON `spectrum` (`msrun_msrun_id` ASC);
CREATE TABLE IF NOT EXISTS `spectrum_has_feature` (
`spectrum_spectrum_id` INT NOT NULL ,
`feature_feature_table_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_has_feature_spectrum1`
FOREIGN KEY (`spectrum_spectrum_id` )
REFERENCES `spectrum` (`spectrum_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `fk_spectrum_has_feature_feature1`
FOREIGN KEY (`feature_feature_table_id` )
REFERENCES `feature` (`feature_table_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_spectrum_has_feature_feature1` ON `spectrum_has_feature` (`feature_feature_table_id` ASC);
CREATE INDEX `fk_spectrum_has_feature_spectrum1` ON `spectrum_has_feature` (`spectrum_spectrum_id` ASC);
Run Code Online (Sandbox Code Playgroud)
我有20938个光谱,305742个特征和2个msruns.结果是10822场比赛.
使用新索引(CREATE INDEX fk_spectrum_msrun1_2ON spectrum(msrun_msrun_id,base_peak_mz);)并保存大约20秒:查询采取:76.4599349499秒5864.15418601秒自fetchall
从EXPLAIN QUERY PLAN打印:
(0, 0, 0, u'SCAN TABLE spectrum (~1000000 rows)'), (0, 1, 1, u'SEARCH TABLE feature USING INDEX fk_feature_msrun1 (msrun_msrun_id=?) (~2 rows)')
Run Code Online (Sandbox Code Playgroud)
小智 5
你正在关联两个大表.一些快速数学:300k x 20k = 60亿行.如果只是返回所有这些行的问题,那么你肯定会受到I/O限制(但实际上只在(O)输出端).但是,你的where子句几乎可以过滤所有内容,因为你只返回了10k行,所以你肯定会在这里绑定CPU.
除了所谓的" OR优化 " 之外,SQLite一次不能使用多个索引.此外,您不会从内部联接获得任何性能增益,因为它们" 被转换为WHERE子句的附加术语 ".
最重要的是,SQLite将无法像say postgresql等人那样高效地执行您的查询.
我玩了你的场景,因为我很想知道你的查询可以优化多少.最终,似乎最好的优化是删除所有显式索引(!).看起来SQLite有一些动态索引/索引可以比我尝试的不同方法获得更好的性能.
作为演示,请考虑从您的模式派生的这个模式:
CREATE TABLE feature ( -- 300k
feature_id INTEGER PRIMARY KEY,
mzMin DOUBLE,
mzMax DOUBLE,
rtMin DOUBLE,
rtMax DOUBLE,
lnk_feature INT);
CREATE TABLE spectrum ( -- 20k
spectrum_id INTEGER PRIMARY KEY,
mz DOUBLE,
rt DOUBLE,
lnk_spectrum INT);
Run Code Online (Sandbox Code Playgroud)
feature有300k行和spectrum20k(执行此操作的python代码位于下方).由于定义,没有指定显式索引,只有隐式索引INTEGER PRIMARY KEY:
除了INTEGER PRIMARY KEY列之外,UNIQUE和PRIMARY KEY约束都是通过在数据库中创建索引来实现的(与"CREATE UNIQUE INDEX"语句相同).这样的索引与数据库中的任何其他索引一样用于优化查询.因此,在已经集体服务于UNIQUE或PRIMARY KEY约束的一组列上创建索引通常没有优势(但是显着的开销).
使用上面的模式,SQLite提到它会在查询的生命周期中创建一个索引lnk_feature:
sqlite> EXPLAIN QUERY PLAN SELECT feature_id, spectrum_id FROM spectrum, feature
...> WHERE lnk_feature = lnk_spectrum
...> AND rt >= rtMin AND rt <= rtMax
...> AND mz >= mzMin AND mz <= mzMax;
0|0|0|SCAN TABLE spectrum (~20000 rows)
0|1|1|SEARCH TABLE feature USING AUTOMATIC COVERING INDEX (lnk_feature=?) (~7 rows)
Run Code Online (Sandbox Code Playgroud)
即使我测试了该列或其他列的索引,似乎运行该查询的最快方法是没有任何这些索引.
我使用python运行上面查询的最快速度是20分钟.这包括完成.fetchall().你提到在某些时候你会有150倍的行数.我开始研究postgresql我是不是你了; - )...注意你可以在线程中分割工作,并且可能通过可以同时运行的线程数来划分时间来完成查询(即可用的CPU数量).
无论如何,这是我使用的代码.您可以自己运行它并报告查询在您的环境中运行的速度.请注意我正在使用apsw,所以如果你不能使用它,你需要调整使用自己的sqlite3模块.
#!/usr/bin/python
import apsw, random as rand, time
def populate(cu):
cu.execute("""
CREATE TABLE feature ( -- 300k
feature_id INTEGER PRIMARY KEY,
mzMin DOUBLE, mzMax DOUBLE,
rtMin DOUBLE, rtMax DOUBLE,
lnk_feature INT);
CREATE TABLE spectrum ( -- 20k
spectrum_id INTEGER PRIMARY KEY,
mz DOUBLE, rt DOUBLE,
lnk_spectrum INT);""")
cu.execute("BEGIN")
for i in range(300000):
((mzMin, mzMax), (rtMin, rtMax)) = (get_min_max(), get_min_max())
cu.execute("INSERT INTO feature VALUES (NULL,%s,%s,%s,%s,%s)"
% (mzMin, mzMax, rtMin, rtMax, get_lnk()))
for i in range(20000):
cu.execute("INSERT INTO spectrum VALUES (NULL,%s,%s,%s)"
% (get_in_between(), get_in_between(), get_lnk()))
cu.execute("COMMIT")
cu.execute("ANALYZE")
def get_lnk():
return rand.randint(1, 2)
def get_min_max():
return sorted((rand.normalvariate(0.5, 0.004),
rand.normalvariate(0.5, 0.004)))
def get_in_between():
return rand.normalvariate(0.5, 0.49)
def select(cu):
sql = """
SELECT feature_id, spectrum_id FROM spectrum, feature
WHERE lnk_feature = lnk_spectrum
AND rt >= rtMin AND rt <= rtMax
AND mz >= mzMin AND mz <= mzMax"""
start = time.time()
cu.execute(sql)
print ("%s rows; %.2f seconds" % (len(cu.fetchall()), time.time() - start))
cu = apsw.Connection('foo.db').cursor()
populate(cu)
select(cu)
Run Code Online (Sandbox Code Playgroud)
输出我得到:
54626 rows; 1210.96 seconds
Run Code Online (Sandbox Code Playgroud)