为什么Redshift需要进行全表扫描以找到DIST / SORT键的最大值?

Ost*_*cts 2 sql amazon-redshift

我正在Redshift上进行简单的测试,以尝试加快将数据插入Redshift表的速度。我今天注意到的一件事是做这样的事情

CREATE TABLE a (x int) DISTSTYLE key DISTKEY (x) SORTKEY (x);
INSERT INTO a (x) VALUES (1), (2), (3), (4);
VACUUM a; ANALYZE a;

EXPLAIN SELECT MAX(x) FROM a;
Run Code Online (Sandbox Code Playgroud)

产量

QUERY PLAN
XN Aggregate  (cost=0.05..0.05 rows=1 width=4)
  ->  XN Seq Scan on a  (cost=0.00..0.04 rows=4 width=4)
Run Code Online (Sandbox Code Playgroud)

我知道这只是4行,但仍然不应该进行全表扫描以查找预排序列的最大值。是不是元数据包含在工作中ANALYZE

就像进行健全性检查一样,EXPLAINfor SELECT x FROM a WHERE x > 3只扫描2行而不是整个表。

编辑:我向表中插入了1,000,000多行,其随机值从1到10,000。做一个真空和分析。查询计划仍然说它必须扫描所有1,000,004行。

Big*_*Kid 5

Analyzing query plans in a tiny data set does not yield any practical insight on how the database would perform a query.

The optimizer has thresholds and when the cost difference between different plans is small enough it stops considering alternative plans. The idea is that for simple queries, the time spent searching for the "perfect" execution plan, can possibly exceed the total execution time of a less optimal plan.

Redshift has been developed on the code for ParAccel DB. ParAccel has literally hundreds of parameters that can be changed/adjusted to optimize the database for different workloads/situations.

Since Redshift is a "managed" offering, it has these settings preset at levels deemed optimal by Amazon engineers given an "expected" workload.

In general, Redshift and ParAccel are not that great for single slice queries. These queries tend to be run in all slices anyway, even if they are only going to find data in a single slice.

Once a query is executing in a slice, the minimum amount of data read is a block. Depending on block size this can mean hundreds of thousand rows.

Remember, Redshift does not have indexes. So you are not going to have a simple record lookup that will read a few entries off an index and then go laser focused on a single page on the disk. It will always read at least an entire block for that table, and it will do that in every slice.


How to have a meaningful data set to be able to evaluate a query plan?

The short answer is that your table would have a "large number" of data blocks per slice.

How many blocks is per slice is my table going to require? The answer depends on several factors:

  1. Number of nodes in your cluster
  2. Type of node in the cluster - Number of slices per node
  3. Data Type - How many bytes each value requires.
  4. The type of compression encoding for the column involved in the query. The optimal encoding depends on data demographics

So let's start at the top.

Redshift is an MPP Database, where processing is spread accross multiple nodes. See Redshift's architecture here.

Each node is further sub-divided in slices, which are dedicated data partitions and corresponding hardware resources to process queries on that partition of the data.

When a table is created in Redshift, and data is inserted, Redshift will allocate a minimum of one block per slice.


Here is a simple example:

If you created a cluster with two ds1.8xlarge nodes, you would have 16 slices per node times two nodes for a total of 32 slices.

Let's say we are querying and column in the WHERE clause is something like "ITEM_COUNT" an integer. An integer consumes 4 bytes.

Redshift uses a block size of 1MB.

So in this scenario, your ITEM_COUNT column would have available to it a minimum of 32 blocks times block size of 1MB which would equate to 32MB of storage.

If you have 32MB of storage and each entry only consumes 4 bytes, you can have more than 8 million entries, and they could all fit inside of a single block.

In this example in the Amazon Redshift documentation they load close to 40 million rows to evaluate and compare different encoding techniques. Read it here.


But wait.....

There is compression, if you have a 75% compression rate, that would mean that even 32 million records would still be able to fit into that single block.

What is the bottom line?

In order to analyze your query plan you would need tables, columns that have several blocks. In our example above 32 milion rows would still be a single block.

This means that in the configuration above, with all the assumptions, a table with a single record would basically most likely have the same query plan as a table with 32 million records, because, in both cases the database only needs to read a single block per slice.


If you want to understand how your data is distributed across slices and how many blocks are being used you can use the queries below:

How many rows per slice:

Select trim(name) as table_name, id, slice, sorted_rows, rows
from stv_tbl_perm
where name like '<<your-tablename>>'
order by slice;
Run Code Online (Sandbox Code Playgroud)

How to count how many blocks:

select trim(name) as table_name, col,  b.slice, b.num_values, count(b.slice)
from stv_tbl_perm a, stv_blocklist b
where a.id = b.tbl
  and a.slice = b.slice
and name like '<<your-tablename>>'
group by 1,2,3,4
order by col, slice;
Run Code Online (Sandbox Code Playgroud)