val*_*tis 6 postgresql performance execution-plan query-performance
我导入了ip2location_db11 lite 数据库的副本,其中包含 3,319,097 行,并且我希望优化数字范围查询,其中低值和高值位于表 ( ip_from
, ip_to
) 的不同列中。
导入数据库:
CREATE TABLE ip2location_db11
(
ip_from bigint NOT NULL, -- First IP address in netblock.
ip_to bigint NOT NULL, -- Last IP address in netblock.
country_code character(2) NOT NULL, -- Two-character country code based on ISO 3166.
country_name character varying(64) NOT NULL, -- Country name based on ISO 3166.
region_name character varying(128) NOT NULL, -- Region or state name.
city_name character varying(128) NOT NULL, -- City name.
latitude real NOT NULL, -- City latitude. Default to capital city latitude if city is unknown.
longitude real NOT NULL, -- City longitude. Default to capital city longitude if city is unknown.
zip_code character varying(30) NOT NULL, -- ZIP/Postal code.
time_zone character varying(8) NOT NULL, -- UTC time zone (with DST supported).
CONSTRAINT ip2location_db11_pkey PRIMARY KEY (ip_from, ip_to)
);
\copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';
Run Code Online (Sandbox Code Playgroud)
我的第一个简单的索引尝试是在每个列上创建单独的索引,这导致查询时间为 400 毫秒的顺序扫描:
account=> CREATE INDEX ip_from_db11_idx ON ip2location_db11 (ip_from);
account=> CREATE INDEX ip_to_db11_idx ON ip2location_db11 (ip_to);
account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.ip2location_db11 (cost=0.00..48930.99 rows=43111 width=842) (actual time=286.714..401.805 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Rows Removed by Filter: 3319096
Planning time: 0.155 ms
Execution time: 401.834 ms
(6 rows)
account=> \d ip2location_db11
Table "public.ip2location_db11"
Column | Type | Modifiers
--------------+------------------------+-----------
ip_from | bigint | not null
ip_to | bigint | not null
country_code | character(2) | not null
country_name | character varying(64) | not null
region_name | character varying(128) | not null
city_name | character varying(128) | not null
latitude | real | not null
longitude | real | not null
zip_code | character varying(30) | not null
time_zone | character varying(8) | not null
Indexes:
"ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
"ip_from_db11_idx" btree (ip_from)
"ip_to_db11_idx" btree (ip_to)
Run Code Online (Sandbox Code Playgroud)
我的第二次尝试是创建一个多列 btree 索引,这导致了查询时间为 290 毫秒的索引扫描:
account=> CREATE INDEX ip_range_db11_idx ON ip2location_db11 (ip_from,ip_to);
account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ip_to_db11_idx on public.ip2location_db11 (cost=0.43..51334.91 rows=756866 width=69) (actual time=1.109..289.143 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Index Cond: ('2538629520'::bigint <= ip2location_db11.ip_to)
Filter: ('2538629520'::bigint >= ip2location_db11.ip_from)
Rows Removed by Filter: 1160706
Planning time: 0.324 ms
Execution time: 289.172 ms
(7 rows)
n4l_account=> \d ip2location_db11
Table "public.ip2location_db11"
Column | Type | Modifiers
--------------+------------------------+-----------
ip_from | bigint | not null
ip_to | bigint | not null
country_code | character(2) | not null
country_name | character varying(64) | not null
region_name | character varying(128) | not null
city_name | character varying(128) | not null
latitude | real | not null
longitude | real | not null
zip_code | character varying(30) | not null
time_zone | character varying(8) | not null
Indexes:
"ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
"ip_from_db11_idx" btree (ip_from)
"ip_range_db11_idx" btree (ip_from, ip_to)
"ip_to_db11_idx" btree (ip_to)
Run Code Online (Sandbox Code Playgroud)
更新:根据评论中的要求,我重新完成了上述查询。重新建表后前15次查询的时间(165ms、65ms、86ms、83ms、86ms、64ms、85ms、811ms、868ms、845ms、810ms、781ms、797ms、860ms)、800ms
account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.ip2location_db11 (cost=28200.29..76843.12 rows=368789 width=842) (actual time=64.866..64.866 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Recheck Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Heap Blocks: exact=1
Buffers: shared hit=8273
-> Bitmap Index Scan on ip_range_db11_idx (cost=0.00..28108.09 rows=368789 width=0) (actual time=64.859..64.859 rows=1 loops=1)
Index Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Buffers: shared hit=8272
Planning time: 0.099 ms
Execution time: 64.907 ms
(10 rows)
account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.ip2location_db11 (cost=0.00..92906.18 rows=754776 width=69) (actual time=577.234..811.757 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Rows Removed by Filter: 3319096
Buffers: shared hit=33 read=43078
Planning time: 0.667 ms
Execution time: 811.783 ms
(7 rows)
Run Code Online (Sandbox Code Playgroud)
导入的 CSV 文件中的示例行:
"0","16777215","-","-","-","-","0.000000","0.000000","-","-"
"16777216","16777471","AU","Australia","Queensland","Brisbane","-27.467940","153.028090","4000","+10:00"
"16777472","16778239","CN","China","Fujian","Fuzhou","26.061390","119.306110","350004","+08:00"
Run Code Online (Sandbox Code Playgroud)
有没有更好的方法来索引这个表来改进查询,或者有没有更有效的查询可以得到相同的结果?
这与已经提供的解决方案略有不同,后者涉及使用空间索引来做一些技巧。
相反,值得记住的是,对于 IP 地址,您不能有重叠的范围。那是A -> B
不能X -> Y
以任何方式相交的。知道了这一点,您可以SELECT
稍微更改您的查询并利用此特性。在利用这一特性时,您根本不需要任何“聪明”的索引。事实上,您只需要索引您的ip_from
列。
以前,正在分析的查询是:
SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
Run Code Online (Sandbox Code Playgroud)
让我们假设2538629520
落入的范围恰好是2538629512
和2538629537
。
注意:范围是多少并不重要,这只是为了帮助演示我们可以利用的模式。
由此我们可以假设下一个ip_from
值是2538629538
。我们实际上不需要担心任何高于此ip_from
值的记录。事实上,我们真正关心的只是ip_from
equals 的范围2538629512
。
知道这个事实,我们的查询实际上变成了(英文):
找出我的
ip_from
IP 地址高于的最大值ip_from
。告诉我你在哪里找到这个值的记录。或者换句话说:
ip_from
在我的 IP 地址之前找到我的值并给我那个记录
因为我们从来没有重叠的范围ip_from
以ip_to
这个成立,使我们能够编写查询为:
SELECT *
FROM ip2location
WHERE ip_from = (
SELECT MAX(ip_from)
FROM ip2location
WHERE ip_from <= 2538629520
)
Run Code Online (Sandbox Code Playgroud)
回到索引以利用所有这些。我们实际看到的只是 ip_from 并且我们正在进行整数比较。MIN(ip_from) 让 PostgreSQL 找到第一个可用的记录。这很好,因为我们可以寻求正确的方法,然后根本不用担心任何其他记录。
我们真正需要的只是一个像这样的索引:
CREATE UNIQUE INDEX CONCURRENTLY ix_ip2location_ipFrom ON public.ip2location(ip_from)
我们可以使索引唯一,因为我们不会有重叠的记录。我什至会自己将此列作为主键。
有了这个索引和这个查询,解释计划是:
Index Scan using ix_ip2location_ipfrom on public.ip2location (cost=0.90..8.92 rows=1 width=69) (actual time=0.530..0.533 rows=1 loops=1)
Output: ip2location.ip_from, ip2location.ip_to, ip2location.country_code, ip2location.country_name, ip2location.region_name, ip2location.city_name, ip2location.latitude, ip2location.longitude, ip2location.zip_code, ip2location.time_zone
Index Cond: (ip2location.ip_from = $1)
InitPlan 2 (returns $1)
-> Result (cost=0.46..0.47 rows=1 width=8) (actual time=0.452..0.452 rows=1 loops=1)
Output: $0
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.46 rows=1 width=8) (actual time=0.443..0.444 rows=1 loops=1)
Output: ip2location_1.ip_from
-> Index Only Scan using ix_ip2location_ipfrom on public.ip2location ip2location_1 (cost=0.43..35440.79 rows=1144218 width=8) (actual time=0.438..0.438 rows=1 loops=1)
Output: ip2location_1.ip_from
Index Cond: ((ip2location_1.ip_from IS NOT NULL) AND (ip2location_1.ip_from >= '2538629520'::bigint))
Heap Fetches: 0
Run Code Online (Sandbox Code Playgroud)
为了让您了解使用这种方法提高查询性能,我在我的 Raspberry Pi 上进行了测试。原始方法大约需要 4 秒。这种方法大约需要 120 毫秒。最大的收获是从个人行寻找诗句的一些扫描。由于需要在结果中考虑更多的表,因此原始查询会受到低范围值的影响。此查询将在值范围内表现出一致的性能。
希望这会有所帮助,我的解释对你们所有人都有意义。
归档时间: |
|
查看次数: |
689 次 |
最近记录: |