18M +行表的子查询和MySQL缓存

Ben*_*Ben 7 mysql caching subquery memcachedb

由于这是我的第一篇文章,似乎我只能发布1个链接,所以我列出了我在底部指的网站.简而言之,我的目标是让数据库更快地返回结果,我试图包含尽可能多的相关信息,以帮助构建帖子底部的问题.

机器信息


8 processors
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
cache size      : 6144 KB
cpu cores       : 4 

top - 17:11:48 up 35 days, 22:22, 10 users,  load average: 1.35, 4.89, 7.80
Tasks: 329 total,   1 running, 328 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 87.4%id, 12.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8173980k total,  5374348k used,  2799632k free,    30148k buffers
Swap: 16777208k total,  6385312k used, 10391896k free,  2615836k cached
Run Code Online (Sandbox Code Playgroud)

但是,我们正在考虑将mysql安装移动到具有256 GB RAM的群集中的其他计算机

表信息


我的MySQL表看起来像

CREATE TABLE ClusterMatches 
(
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    cluster_index INT, 
    matches LONGTEXT,
    tfidf FLOAT,
    INDEX(cluster_index)   
);
Run Code Online (Sandbox Code Playgroud)

它有大约18M行,有1M个唯一的cluster_index和6K唯一匹配.我在PHP中生成的SQL查询看起来像.

SQL查询


$sql_query="SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters.")) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";
Run Code Online (Sandbox Code Playgroud)

其中$ cluster包含一个大约3,000个逗号分隔的cluster_index的字符串.此查询使用大约50,000行并运行大约15秒,当再次运行相同的查询时,运行大约需要1秒.

用法


  1. 可以假设表的内容是静态的.
  2. 并发用户数较少
  3. 上面的查询是目前唯一将在表上运行的查询

子查询


基于这篇文章[stackoverflow:缓存/重新使用MySQL中的子查询] [1]以及查询时间的改进我相信我的子查询可以被编入索引.

mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;

+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table                | type  | possible_keys | key           | key_len | ref  | rows  | Extra                           |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
|  1 | PRIMARY     |  derived2            | ALL   | NULL          | NULL          | NULL    | NULL | 48528 | Using temporary; Using filesort | 
|  2 | DERIVED     | ClusterMatches       | range | cluster_index | cluster_index | 5       | NULL | 53689 | Using where                     | 
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+

Run Code Online (Sandbox Code Playgroud)

根据这篇旧文章[优化MySQL:查询和索引] [2]的额外信息 - 这里看到的不好的是"使用临时"和"使用filesort"

MySQL配置信息


查询缓存可用,但由于当前大小设置为零,因此有效关闭


mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name                   | Value                |
+---------------------------------+----------------------+
| bdb_cache_size                  | 8384512              | 
| binlog_cache_size               | 32768                | 
| expire_logs_days                | 0                    |
| have_query_cache                | YES                  | 
| flush                           | OFF                  |
| flush_time                      | 0                    |
| innodb_additional_mem_pool_size | 1048576              |
| innodb_autoextend_increment     | 8                    |
| innodb_buffer_pool_awe_mem_mb   | 0                    |
| innodb_buffer_pool_size         | 8388608              |
| join_buffer_size                | 131072               |
| key_buffer_size                 | 8384512              |
| key_cache_age_threshold         | 300                  |
| key_cache_block_size            | 1024                 |
| key_cache_division_limit        | 100                  |
| max_binlog_cache_size           | 18446744073709547520 | 
| sort_buffer_size                | 2097144              |
| table_cache                     | 64                   | 
| thread_cache_size               | 0                    | 
| query_cache_limit               | 1048576              |
| query_cache_min_res_unit        | 4096                 |
| query_cache_size                | 0                    |
| query_cache_type                | ON                   |
| query_cache_wlock_invalidate    | OFF                  |
| read_rnd_buffer_size            | 262144               |
+---------------------------------+----------------------+

Run Code Online (Sandbox Code Playgroud)

基于这篇关于[Mysql数据库性能转换] [3]的文章,我认为我需要调整的值是

  1. table_cache的
  2. 的key_buffer
  3. sort_buffer
  4. read_buffer_size
  5. record_rnd_buffer(对于GROUP BY和ORDER BY术语)

确定需要改进的领域 - MySQL Query调整


  1. 将匹配的数据类型更改为指向另一个表的int的索引[如果它包含TEXT或BLOB等可变长度字段,MySQL确实会使用动态行格式,在这种情况下,这意味着需要在磁盘上进行排序.解决方案不是要避开这些数据类型,而是将这些字段拆分为关联的表.] [4]
  2. 索引新的match_index feild以便GROUP BY matches更快地发生,基于语句["你应该为你选择,分组,排序或加入的任何字段创建索引."] [5]

工具


要调整执行我计划使用

  1. [解释] [6]参考[输出格式] [7]
  2. [ab - Apache HTTP服务器基准测试工具] [8]
  3. [分析] [9]与[日志数据] [10]

未来数据库大小


目标是构建一个系统,该系统可以具有1M个唯一的cluster_index值1M唯一匹配值,大约3,000,000,000个表行,对查询的响应时间约为0.5秒(我们可以根据需要添加更多ram并在整个集群中分发数据库)

问题


  1. 我认为我们希望将整个记录集保留在ram中,以便查询不会触及磁盘,如果我们将整个数据库保留在MySQL缓存中,那么就不需要memcachedb吗?
  2. 试图将整个数据库保存在MySQL缓存中是一个糟糕的策略,因为它不是为了持久性而设计的?像memcachedb或redis这样的东西是更好的方法,如果是这样的话为什么呢?
  3. 查询完成后,查询创建的临时表"结果"是否会自动销毁?
  4. 我们是否应该从Innodb切换到MyISAM [因为它有利于读取重量数据,而InnoDB对于写入更重要] [11]?
  5. 在我的[查询缓存配置] [12]中,我的缓存似乎没有显示为零,为什么第二次运行时查询当前发生的速度更快?
  6. 我可以重构我的查询以消除"使用临时"和"使用filesort"发生,我应该使用连接而不是子查询?
  7. 你如何看待MySQL [数据缓存] [13]的大小?
  8. 你建议将值table_cache,key_buffer,sort_buffer,read_buffer_size,record_rnd_buffer的大小作为起点?

链接


  • 1:stackoverflow.com/questions/658937/cache-re-use-a-subquery-in-mysql
  • 2:databasejournal.com/features/mysql/article.php/10897_1382791_4/Optimizing-MySQL-Queries-and-Indexes.htm
  • 3:debianhelp.co.uk/mysqlperformance.htm
  • 4:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
  • 5:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
  • 6:dev.mysql.com/doc/refman/5.0/en/explain.html
  • 7:dev.mysql.com/doc/refman/5.0/en/explain-output.html
  • 8:httpd.apache.org/docs/2.2/programs/ab.html
  • 9:mtop.sourceforge.net/
  • 10:dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
  • 11:20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
  • 12:dev.mysql.com/doc/refman/5.0/en/query-cache-configuration.html
  • 13:dev.mysql.com/tech-resources/articles/mysql-query-cache.html

Ben*_*Ben 1

换桌子


根据这篇关于如何为排序依据和分组依据查询选择索引的文章中的建议,表现在看起来像这样

CREATE TABLE ClusterMatches 
(
    cluster_index INT UNSIGNED, 
    match_index INT UNSIGNED,
    id INT NOT NULL AUTO_INCREMENT,
    tfidf FLOAT,
    PRIMARY KEY (match_index,cluster_index,id,tfidf)
);
CREATE TABLE MatchLookup 
(
    match_index INT UNSIGNED NOT NULL PRIMARY KEY,
    image_match TINYTEXT
);
Run Code Online (Sandbox Code Playgroud)

消除子查询

未按 SUM(tfidf) 对结果进行排序的查询如下所示

SELECT match_index, SUM(tfidf) FROM ClusterMatches 
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
Run Code Online (Sandbox Code Playgroud)

这消除了使用临时和使用文件排序

explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches 
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                    |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
|  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 14938 | Using where; Using index | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
Run Code Online (Sandbox Code Playgroud)

排序问题

但是,如果我在中添加 ORDER BY SUM(tfdif)

SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index 
ORDER BY total DESC LIMIT 0,10;
+-------------+--------------------+
| match_index | total              |
+-------------+--------------------+
|         868 |   0.11126546561718 | 
|        4182 | 0.0238558370620012 | 
|        2162 | 0.0216601379215717 | 
|        1406 | 0.0191618576645851 | 
|        4239 | 0.0168981291353703 | 
|        1437 | 0.0160425212234259 | 
|        2599 | 0.0156466849148273 | 
|         394 | 0.0155945559963584 | 
|        3116 | 0.0151005545631051 | 
|        4028 | 0.0149106932803988 | 
+-------------+--------------------+
10 rows in set (0.03 sec)
Run Code Online (Sandbox Code Playgroud)

在这种规模下,结果相当快,但具有ORDER BY SUM(tfidf) 意味着它使用临时和文件排序

explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index 
ORDER BY total DESC LIMIT 0,10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                                                     |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
|  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 65369 | Using where; Using index; Using temporary; Using filesort | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

可能的解决方案?

我正在寻找一种不使用临时或文件排序的解决方案,类似于

SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index 
HAVING total>0.01 ORDER BY cluster_index;
Run Code Online (Sandbox Code Playgroud) 我不需要硬编码总数的阈值,有什么想法吗?