PostgreSQL 中的慢查询从两列中定义的范围之间选择单行

val*_*tis 6 postgresql performance execution-plan query-performance

我导入了ip2location_db11 lite 数据库的副本,其中包含 3,319,097 行,并且我希望优化数字范围查询,其中低值和高值位于表 ( ip_from, ip_to) 的不同列中。

导入数据库:

CREATE TABLE ip2location_db11
(
  ip_from bigint NOT NULL, -- First IP address in netblock.
  ip_to bigint NOT NULL, -- Last IP address in netblock.
  country_code character(2) NOT NULL, -- Two-character country code based on ISO 3166.
  country_name character varying(64) NOT NULL, -- Country name based on ISO 3166.
  region_name character varying(128) NOT NULL, -- Region or state name.
  city_name character varying(128) NOT NULL, -- City name.
  latitude real NOT NULL, -- City latitude. Default to capital city latitude if city is unknown.
  longitude real NOT NULL, -- City longitude. Default to capital city longitude if city is unknown.
  zip_code character varying(30) NOT NULL, -- ZIP/Postal code.
  time_zone character varying(8) NOT NULL, -- UTC time zone (with DST supported).
  CONSTRAINT ip2location_db11_pkey PRIMARY KEY (ip_from, ip_to)
);
\copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';
Run Code Online (Sandbox Code Playgroud)

我的第一个简单的索引尝试是在每个列上创建单独的索引,这导致查询时间为 400 毫秒的顺序扫描:

account=> CREATE INDEX ip_from_db11_idx ON ip2location_db11 (ip_from);
account=> CREATE INDEX ip_to_db11_idx ON ip2location_db11 (ip_to);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;

                                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2location_db11  (cost=0.00..48930.99 rows=43111 width=842) (actual time=286.714..401.805 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Rows Removed by Filter: 3319096
 Planning time: 0.155 ms
 Execution time: 401.834 ms
(6 rows)

account=> \d ip2location_db11
          Table "public.ip2location_db11"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 ip_from      | bigint                 | not null
 ip_to        | bigint                 | not null
 country_code | character(2)           | not null
 country_name | character varying(64)  | not null
 region_name  | character varying(128) | not null
 city_name    | character varying(128) | not null
 latitude     | real                   | not null
 longitude    | real                   | not null
 zip_code     | character varying(30)  | not null
 time_zone    | character varying(8)   | not null
Indexes:
    "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
    "ip_from_db11_idx" btree (ip_from)
    "ip_to_db11_idx" btree (ip_to)
Run Code Online (Sandbox Code Playgroud)

我的第二次尝试是创建一个多列 btree 索引,这导致了查询时间为 290 毫秒的索引扫描:

account=> CREATE INDEX ip_range_db11_idx ON ip2location_db11 (ip_from,ip_to);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ip_to_db11_idx on public.ip2location_db11 (cost=0.43..51334.91 rows=756866 width=69) (actual time=1.109..289.143 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Index Cond: ('2538629520'::bigint <= ip2location_db11.ip_to)
   Filter: ('2538629520'::bigint >= ip2location_db11.ip_from)
   Rows Removed by Filter: 1160706
 Planning time: 0.324 ms
 Execution time: 289.172 ms
(7 rows)

n4l_account=> \d ip2location_db11
          Table "public.ip2location_db11"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 ip_from      | bigint                 | not null
 ip_to        | bigint                 | not null
 country_code | character(2)           | not null
 country_name | character varying(64)  | not null
 region_name  | character varying(128) | not null
 city_name    | character varying(128) | not null
 latitude     | real                   | not null
 longitude    | real                   | not null
 zip_code     | character varying(30)  | not null
 time_zone    | character varying(8)   | not null
Indexes:
    "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
    "ip_from_db11_idx" btree (ip_from)
    "ip_range_db11_idx" btree (ip_from, ip_to)
    "ip_to_db11_idx" btree (ip_to)
Run Code Online (Sandbox Code Playgroud)

更新:根据评论中的要求,我重新完成了上述查询。重新建表后前15次查询的时间(165ms、65ms、86ms、83ms、86ms、64ms、85ms、811ms、868ms、845ms、810ms、781ms、797ms、860ms)、800ms

account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.ip2location_db11  (cost=28200.29..76843.12 rows=368789 width=842) (actual time=64.866..64.866 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Recheck Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Heap Blocks: exact=1
   Buffers: shared hit=8273
   ->  Bitmap Index Scan on ip_range_db11_idx  (cost=0.00..28108.09 rows=368789 width=0) (actual time=64.859..64.859 rows=1 loops=1)
         Index Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
         Buffers: shared hit=8272
 Planning time: 0.099 ms
 Execution time: 64.907 ms
(10 rows)

account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2location_db11  (cost=0.00..92906.18 rows=754776 width=69) (actual time=577.234..811.757 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Rows Removed by Filter: 3319096
   Buffers: shared hit=33 read=43078
 Planning time: 0.667 ms
 Execution time: 811.783 ms
(7 rows)
Run Code Online (Sandbox Code Playgroud)

导入的 CSV 文件中的示例行:

"0","16777215","-","-","-","-","0.000000","0.000000","-","-"
"16777216","16777471","AU","Australia","Queensland","Brisbane","-27.467940","153.028090","4000","+10:00"
"16777472","16778239","CN","China","Fujian","Fuzhou","26.061390","119.306110","350004","+08:00"
Run Code Online (Sandbox Code Playgroud)

有没有更好的方法来索引这个表来改进查询,或者有没有更有效的查询可以得到相同的结果?

Ken*_*ery 5

这与已经提供的解决方案略有不同,后者涉及使用空间索引来做一些技巧。

相反,值得记住的是,对于 IP 地址,您不能有重叠的范围。那是A -> B不能X -> Y以任何方式相交的。知道了这一点,您可以SELECT稍微更改您的查询并利用此特性。在利用这一特性时,您根本不需要任何“聪明”的索引。事实上,您只需要索引您的ip_from列。

以前,正在分析的查询是:

SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
Run Code Online (Sandbox Code Playgroud)

让我们假设2538629520落入的范围恰好是25386295122538629537

注意:范围是多少并不重要,这只是为了帮助演示我们可以利用的模式。

由此我们可以假设下一个ip_from值是2538629538。我们实际上不需要担心任何高于此ip_from值的记录。事实上,我们真正关心的只是ip_from equals 的范围2538629512

知道这个事实,我们的查询实际上变成了(英文):

找出我的ip_fromIP 地址高于的最大值ip_from。告诉我你在哪里找到这个值的记录。

或者换句话说:ip_from在我的 IP 地址之前找到我的值并给我那个记录

因为我们从来没有重叠的范围ip_fromip_to这个成立,使我们能够编写查询为:

SELECT * 
FROM ip2location
WHERE ip_from = (
    SELECT MAX(ip_from)
    FROM ip2location
    WHERE ip_from <= 2538629520
    )
Run Code Online (Sandbox Code Playgroud)

回到索引以利用所有这些。我们实际看到的只是 ip_from 并且我们正在进行整数比较。MIN(ip_from) 让 PostgreSQL 找到第一个可用的记录。这很好,因为我们可以寻求正确的方法,然后根本不用担心任何其他记录。

我们真正需要的只是一个像这样的索引:

CREATE UNIQUE INDEX CONCURRENTLY ix_ip2location_ipFrom ON public.ip2location(ip_from)

我们可以使索引唯一,因为我们不会有重叠的记录。我什至会自己将此列作为主键。

有了这个索引和这个查询,解释计划是:

Index Scan using ix_ip2location_ipfrom on public.ip2location  (cost=0.90..8.92 rows=1 width=69) (actual time=0.530..0.533 rows=1 loops=1)
Output: ip2location.ip_from, ip2location.ip_to, ip2location.country_code, ip2location.country_name, ip2location.region_name, ip2location.city_name, ip2location.latitude, ip2location.longitude, ip2location.zip_code, ip2location.time_zone
Index Cond: (ip2location.ip_from = $1)
InitPlan 2 (returns $1)
    ->  Result  (cost=0.46..0.47 rows=1 width=8) (actual time=0.452..0.452 rows=1 loops=1)
        Output: $0
        InitPlan 1 (returns $0)
            ->  Limit  (cost=0.43..0.46 rows=1 width=8) (actual time=0.443..0.444 rows=1 loops=1)
                Output: ip2location_1.ip_from
                ->  Index Only Scan using ix_ip2location_ipfrom on public.ip2location ip2location_1  (cost=0.43..35440.79 rows=1144218 width=8) (actual time=0.438..0.438 rows=1 loops=1)
                        Output: ip2location_1.ip_from
                        Index Cond: ((ip2location_1.ip_from IS NOT NULL) AND (ip2location_1.ip_from >= '2538629520'::bigint))
                        Heap Fetches: 0
Run Code Online (Sandbox Code Playgroud)

为了让您了解使用这种方法提高查询性能,我在我的 Raspberry Pi 上进行了测试。原始方法大约需要 4 秒。这种方法大约需要 120 毫秒。最大的收获是从个人行寻找诗句的一些扫描。由于需要在结果中考虑更多的表,因此原始查询会受到低范围值的影响。此查询将在值范围内表现出一致的性能。

希望这会有所帮助,我的解释对你们所有人都有意义。