Ben*_*Ben 7 mysql caching subquery memcachedb
由于这是我的第一篇文章,似乎我只能发布1个链接,所以我列出了我在底部指的网站.简而言之,我的目标是让数据库更快地返回结果,我试图包含尽可能多的相关信息,以帮助构建帖子底部的问题.
8 processors
model name : Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
cache size : 6144 KB
cpu cores : 4
top - 17:11:48 up 35 days, 22:22, 10 users, load average: 1.35, 4.89, 7.80
Tasks: 329 total, 1 running, 328 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 87.4%id, 12.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8173980k total, 5374348k used, 2799632k free, 30148k buffers
Swap: 16777208k total, 6385312k used, 10391896k free, 2615836k cached
Run Code Online (Sandbox Code Playgroud)
但是,我们正在考虑将mysql安装移动到具有256 GB RAM的群集中的其他计算机
我的MySQL表看起来像
CREATE TABLE ClusterMatches
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
cluster_index INT,
matches LONGTEXT,
tfidf FLOAT,
INDEX(cluster_index)
);Run Code Online (Sandbox Code Playgroud)
它有大约18M行,有1M个唯一的cluster_index和6K唯一匹配.我在PHP中生成的SQL查询看起来像.
$sql_query="SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters."))
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";Run Code Online (Sandbox Code Playgroud)
其中$ cluster包含一个大约3,000个逗号分隔的cluster_index的字符串.此查询使用大约50,000行并运行大约15秒,当再次运行相同的查询时,运行大约需要1秒.
基于这篇文章[stackoverflow:缓存/重新使用MySQL中的子查询] [1]以及查询时间的改进我相信我的子查询可以被编入索引.
mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000)
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| 1 | PRIMARY | derived2 | ALL | NULL | NULL | NULL | NULL | 48528 | Using temporary; Using filesort |
| 2 | DERIVED | ClusterMatches | range | cluster_index | cluster_index | 5 | NULL | 53689 | Using where |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
Run Code Online (Sandbox Code Playgroud)
根据这篇旧文章[优化MySQL:查询和索引] [2]的额外信息 - 这里看到的不好的是"使用临时"和"使用filesort"
查询缓存可用,但由于当前大小设置为零,因此有效关闭
mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name | Value |
+---------------------------------+----------------------+
| bdb_cache_size | 8384512 |
| binlog_cache_size | 32768 |
| expire_logs_days | 0 |
| have_query_cache | YES |
| flush | OFF |
| flush_time | 0 |
| innodb_additional_mem_pool_size | 1048576 |
| innodb_autoextend_increment | 8 |
| innodb_buffer_pool_awe_mem_mb | 0 |
| innodb_buffer_pool_size | 8388608 |
| join_buffer_size | 131072 |
| key_buffer_size | 8384512 |
| key_cache_age_threshold | 300 |
| key_cache_block_size | 1024 |
| key_cache_division_limit | 100 |
| max_binlog_cache_size | 18446744073709547520 |
| sort_buffer_size | 2097144 |
| table_cache | 64 |
| thread_cache_size | 0 |
| query_cache_limit | 1048576 |
| query_cache_min_res_unit | 4096 |
| query_cache_size | 0 |
| query_cache_type | ON |
| query_cache_wlock_invalidate | OFF |
| read_rnd_buffer_size | 262144 |
+---------------------------------+----------------------+
Run Code Online (Sandbox Code Playgroud)
基于这篇关于[Mysql数据库性能转换] [3]的文章,我认为我需要调整的值是
matches更快地发生,基于语句["你应该为你选择,分组,排序或加入的任何字段创建索引."] [5]要调整执行我计划使用
目标是构建一个系统,该系统可以具有1M个唯一的cluster_index值1M唯一匹配值,大约3,000,000,000个表行,对查询的响应时间约为0.5秒(我们可以根据需要添加更多ram并在整个集群中分发数据库)
根据这篇关于如何为排序依据和分组依据查询选择索引的文章中的建议,表现在看起来像这样
CREATE TABLE ClusterMatches
(
cluster_index INT UNSIGNED,
match_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (match_index,cluster_index,id,tfidf)
);
CREATE TABLE MatchLookup
(
match_index INT UNSIGNED NOT NULL PRIMARY KEY,
image_match TINYTEXT
);
Run Code Online (Sandbox Code Playgroud)
未按 SUM(tfidf) 对结果进行排序的查询如下所示
SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;Run Code Online (Sandbox Code Playgroud)
这消除了使用临时和使用文件排序
explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 14938 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+Run Code Online (Sandbox Code Playgroud)
但是,如果我在中添加 ORDER BY SUM(tfdif)
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+-------------+--------------------+
| match_index | total |
+-------------+--------------------+
| 868 | 0.11126546561718 |
| 4182 | 0.0238558370620012 |
| 2162 | 0.0216601379215717 |
| 1406 | 0.0191618576645851 |
| 4239 | 0.0168981291353703 |
| 1437 | 0.0160425212234259 |
| 2599 | 0.0156466849148273 |
| 394 | 0.0155945559963584 |
| 3116 | 0.0151005545631051 |
| 4028 | 0.0149106932803988 |
+-------------+--------------------+
10 rows in set (0.03 sec)Run Code Online (Sandbox Code Playgroud)
在这种规模下,结果相当快,但具有ORDER BY SUM(tfidf) 意味着它使用临时和文件排序
explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 65369 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+Run Code Online (Sandbox Code Playgroud)
我正在寻找一种不使用临时或文件排序的解决方案,类似于
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index
HAVING total>0.01 ORDER BY cluster_index;Run Code Online (Sandbox Code Playgroud)
我不需要硬编码总数的阈值,有什么想法吗?