Ost*_*cts 2 sql amazon-redshift
我正在Redshift上进行简单的测试,以尝试加快将数据插入Redshift表的速度。我今天注意到的一件事是做这样的事情
CREATE TABLE a (x int) DISTSTYLE key DISTKEY (x) SORTKEY (x);
INSERT INTO a (x) VALUES (1), (2), (3), (4);
VACUUM a; ANALYZE a;
EXPLAIN SELECT MAX(x) FROM a;
Run Code Online (Sandbox Code Playgroud)
产量
QUERY PLAN
XN Aggregate (cost=0.05..0.05 rows=1 width=4)
-> XN Seq Scan on a (cost=0.00..0.04 rows=4 width=4)
Run Code Online (Sandbox Code Playgroud)
我知道这只是4行,但仍然不应该进行全表扫描以查找预排序列的最大值。是不是元数据包含在工作中ANALYZE?
就像进行健全性检查一样,EXPLAINfor SELECT x FROM a WHERE x > 3只扫描2行而不是整个表。
编辑:我向表中插入了1,000,000多行,其随机值从1到10,000。做一个真空和分析。查询计划仍然说它必须扫描所有1,000,004行。
Analyzing query plans in a tiny data set does not yield any practical insight on how the database would perform a query.
The optimizer has thresholds and when the cost difference between different plans is small enough it stops considering alternative plans. The idea is that for simple queries, the time spent searching for the "perfect" execution plan, can possibly exceed the total execution time of a less optimal plan.
Redshift has been developed on the code for ParAccel DB. ParAccel has literally hundreds of parameters that can be changed/adjusted to optimize the database for different workloads/situations.
Since Redshift is a "managed" offering, it has these settings preset at levels deemed optimal by Amazon engineers given an "expected" workload.
In general, Redshift and ParAccel are not that great for single slice queries. These queries tend to be run in all slices anyway, even if they are only going to find data in a single slice.
Once a query is executing in a slice, the minimum amount of data read is a block. Depending on block size this can mean hundreds of thousand rows.
Remember, Redshift does not have indexes. So you are not going to have a simple record lookup that will read a few entries off an index and then go laser focused on a single page on the disk. It will always read at least an entire block for that table, and it will do that in every slice.
How to have a meaningful data set to be able to evaluate a query plan?
The short answer is that your table would have a "large number" of data blocks per slice.
How many blocks is per slice is my table going to require? The answer depends on several factors:
So let's start at the top.
When a table is created in Redshift, and data is inserted, Redshift will allocate a minimum of one block per slice.
Here is a simple example:
If you created a cluster with two ds1.8xlarge nodes, you would have 16 slices per node times two nodes for a total of 32 slices.
Let's say we are querying and column in the WHERE clause is something like "ITEM_COUNT" an integer. An integer consumes 4 bytes.
Redshift uses a block size of 1MB.
So in this scenario, your ITEM_COUNT column would have available to it a minimum of 32 blocks times block size of 1MB which would equate to 32MB of storage.
If you have 32MB of storage and each entry only consumes 4 bytes, you can have more than 8 million entries, and they could all fit inside of a single block.
But wait.....
There is compression, if you have a 75% compression rate, that would mean that even 32 million records would still be able to fit into that single block.
What is the bottom line?
In order to analyze your query plan you would need tables, columns that have several blocks. In our example above 32 milion rows would still be a single block.
This means that in the configuration above, with all the assumptions, a table with a single record would basically most likely have the same query plan as a table with 32 million records, because, in both cases the database only needs to read a single block per slice.
If you want to understand how your data is distributed across slices and how many blocks are being used you can use the queries below:
How many rows per slice:
Select trim(name) as table_name, id, slice, sorted_rows, rows
from stv_tbl_perm
where name like '<<your-tablename>>'
order by slice;
Run Code Online (Sandbox Code Playgroud)
How to count how many blocks:
select trim(name) as table_name, col, b.slice, b.num_values, count(b.slice)
from stv_tbl_perm a, stv_blocklist b
where a.id = b.tbl
and a.slice = b.slice
and name like '<<your-tablename>>'
group by 1,2,3,4
order by col, slice;
Run Code Online (Sandbox Code Playgroud)