MySQL 中处理数百万行的数据库设计

Question

MySQL 中处理数百万行的数据库设计

Jav*_*i M 8 mysql scalability big-data

我们正在运行的应用程序收集数据的速度比我们预期的要快得多。为了适应这一点，我们正在重新设计数据库。读完这个、这个和这个后，我不确定我们设计的最佳方法是什么......考虑到我们的硬件非常简陋。

三个主要表导致了问题：

扫描
域名
文件
价值观

目前我们只有一张表来存储数据。它们之间的关系是：

1 次扫描->（平均 4x）域名->（平均 3000）许多文档->（平均 51000）许多值
- 1 次扫描平均指向域名上的 4 个条目。
- 域名上的 4 个条目意味着文档上平均有 12,000 个条目
- DOCUMENTS 上的 12000 个条目意味着 VALUES 上平均有 204000 个条目

目前我们每天执行大约 100 次扫描。也就是说，每天向 VALUES 中插入大约 20,400,000 个项目。

我们正在考虑将 VALUES 表拆分为一个“VALUE_table_per_month”：

VALUES_year_month旨在在它们之间分配负载。但如果我们增加扫描仪的数量，这种机制就无法升级。
VALUES_year_month_day那么我们最终将在同一个数据库中放入如此多的表。

在这两种情况下，如果我们增加每天的扫描次数，似乎没有一个解决方案具有可扩展性。

此时，出于可扩展性的原因，将所有数据保存到集中式数据库中似乎不是最佳选择……但与此同时，分布式系统将显着增加加载时间。

什么是合理的方法？我确信我们不是第一个发现这个问题的团队！：P

编辑

每个查询读取多少数据？

这取决于扫描。并非所有扫描都具有相同数量的数据。范围变化如下：

1 次扫描 --> 200 个值
1 次扫描 --> 200.000 个值

该信息在前端呈现给最终用户。因此，我们将查询请求的方式拆分到后端，以避免服务器过载，但在某些情况下，由于 VALUES 数量较多，这还不够。

什么时候读取数据？

这完全取决于最终用户。有时他们每天会读 10 篇 SCANS，有时则不读，有时甚至每天读 100 篇。

编辑 II 分析描述两个查询的结果。第一个快，第二个慢。

EXPLAIN ANALYZE 
SELECT value,
        url,
        filetype, 
        severity,
        COUNT(id_value) AS data_count
FROM VALUES
WHERE (weigth = 150 OR weigth = 100) 
AND id_analysis = 23 
AND is_hidden = 0 
AND is_hidden_by_user = 0 
GROUP BY value 
ORDER BY data_count DESC

Run Code Online (Sandbox Code Playgroud)

结果1：

| -> Sort row IDs: data_count DESC  (actual time=34.016..34.016 rows=0 loops=1)
-> Table scan on <temporary>  (actual time=34.006..34.006 rows=0 loops=1)
    -> Aggregate using temporary table  (actual time=34.005..34.005 rows=0 loops=1)
        -> Filter: ((VALUES.is_hidden_by_user = 0) and (VALUES.is_hidden = 0) and ((VALUES.weigth = 150) or (VALUES.weigth = 100)))  (cost=1.00 rows=0.05) (actual time=0.024..0.024 rows=0 loops=1)
            -> Index lookup on VALUES using id_analysis (id_analysis=23)  (cost=1.00 rows=1) (actual time=0.024..0.024 rows=0 loops=1)

Run Code Online (Sandbox Code Playgroud)

|

结果2：

    | -> Sort row IDs: data_count DESC  (actual time=187172.159..187172.173 rows=136 loops=1)
    -> Table scan on <temporary>  (actual time=187172.079..187172.111 rows=136 loops=1)
        -> Aggregate using temporary table  (actual time=187172.077..187172.077 rows=136 loops=1)
            -> Filter: ((VALUES.is_hidden_by_user = 0) and (VALUES.is_hidden = 0) and ((VALUES.weigth = 150) or (VALUES.weigth = 100)))  (cost=264956.35 rows=695) (actual time=249.030..186775.012 rows=52289 loops=1)
                -> Index lookup on VALUES using id_analysis (id_analysis=8950)  (cost=264956.35 rows=265154) (actual time=248.979..186696.529 rows=134236 loops=1)
 |

Run Code Online (Sandbox Code Playgroud)

编辑三

考虑PARTITION

这是一个很好的建议。荣誉！从我现在读到的内容来看，这本质上相当于按照我们考虑的方式拆分表。

(weigth = 150 OR weigth = 100)是一个相当奇怪的测试。

删除OR子句可以改善时间安排：

| -> Sort row IDs: data_count DESC  (actual time=101261.260..101261.271 rows=113 loops=1)
    -> Table scan on <temporary>  (actual time=101261.187..101261.216 rows=113 loops=1)
        -> Aggregate using temporary table  (actual time=101261.185..101261.185 rows=113 loops=1)
            -> Filter: ((VALUES.is_hidden_by_user = 0) and (VALUES.is_hidden = 0) and (VALUES.id_analysis = 8950) and (VALUES.weigth = 150))  (cost=79965.29 rows=623) (actual time=83848.835..100942.179 rows=52259 loops=1)
                -> Intersect rows sorted by row ID  (cost=79965.29 rows=62292) (actual time=83848.830..100908.758 rows=52259 loops=1)
                    -> Index range scan on VALUES using id_analysis over (id_analysis = 8950)  (cost=291.66 rows=265154) (actual time=0.100..443.145 rows=134236 loops=1)
                    -> Index range scan on VALUES using weigth over (weigth = 150)  (cost=13492.63 rows=12380386) (actual time=0.043..83511.686 rows=7822871 loops=1)
 |

Run Code Online (Sandbox Code Playgroud)

请详细说明value与id_value

我相信这可能只是一个“糟糕的命名”。

+-------------------+-------------+------+-----+---------+----------------+
| Field             | Type        | Null | Key | Default | Extra          |
+-------------------+-------------+------+-----+---------+----------------+
| id_value          | int         | NO   | PRI | NULL    | auto_increment |
| id_document       | int         | NO   | MUL | NULL    |                |
| id_tag            | int         | YES  | MUL | NULL    |                |
| value             | mediumtext  | YES  |     | NULL    |                |
| weigth            | int         | YES  | MUL | NULL    |                |
| id_analysis       | int         | YES  | MUL | NULL    |                |
| url               | text        | YES  |     | NULL    |                |
| domain            | varchar(64) | YES  |     | NULL    |                |
| filetype          | varchar(16) | YES  |     | NULL    |                |
| severity_name     | varchar(16) | YES  |     | NULL    |                |
| id_domain         | int         | YES  | MUL | NULL    |                |
| id_city           | int         | YES  | MUL | NULL    |                |
| city_name         | varchar(32) | YES  |     | NULL    |                |
| is_hidden         | tinyint     | NO   |     | 0       |                |
| id_company        | int         | YES  |     | NULL    |                |
| is_hidden_by_user | tinyint(1)  | NO   |     | 0       |                |
+-------------------+-------------+------+-----+---------+----------------+

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ric*_*mes 8

不要仅仅因为表太大就分割它。如果您需要删除“旧”数据，
请考虑使用大表。当写入数量对于单台机器来说太大时，请考虑“分片”。PARTITIONing

SSD 设备上每秒插入 250 行本身不会触发上述任何原因进行拆分。

如果您的保留期为 2 个月，那么建议您PARTITION BY RANGE(TO_DAYS(...))每月DROP PARTITION+一次。REORGANIZE PARTITION更多讨论：分区

(weigth = 150 OR weigth = 100)是一个相当奇怪的测试。100到150之间没有任何值，还是你故意过滤掉的？我问是因为OR优化变得复杂。

您提出的查询需要

INDEX(id_analysis, is_idden, is_hidden_by_user, weight)

Run Code Online (Sandbox Code Playgroud)

由于，查询编写不正确ONLY_FULL_GROUP_BY。我怀疑是否url, filetype, and severity“依赖” value。

请详细说明value与id_value。这听起来像是查询中的另一个错误。

请详细说明为什么Documents和Values是分开的。这听起来像是“过度正常化”。

或者也许我真的对这个名称感到困惑，VALUES因为它包含 url、文件类型、严重性。

请提供SHOW CREATE TABLE每张桌子。

在数据仓库情况下，汇总表通常是性能的答案。您能否汇总每天的计数，然后汇总这些小计？

@JaviM，我之前没有机会回复，但同意谓词中的“OR”（“JOIN”、“WHERE”、“HAVING”子句）因使执行计划复杂化而臭名昭著。通常，更*有效*的解决方法是对同一查询使用“UNION”两次，每次针对“OR”条件的每个值。除此之外，Rick 在更好的索引、表设计和查询优化方面是正确的，这些都将有助于提高您所寻求的性能，而无需担心跨表、分片拆分数据，甚至考虑分区。 (2认同)

归档时间：	3 年，2 月前
查看次数：	4718 次
最近记录：	3 年，2 月前