为什么Postgres在jsonb列上的查找速度如此之慢？

Question

为什么Postgres在jsonb列上的查找速度如此之慢？

sat*_*shi 5 postgresql indexing indices jsonb postgresql-9.4

我有一个表targeting有一个marital_status类型的列text[]和另一个data类型的列jsonb.这两列的内容是相同的,只是采用不同的格式(仅用于演示目的).示例数据:

 id |      marital_status      |                        data                       
----+--------------------------+---------------------------------------------------
  1 | null                     | {}
  2 | {widowed}                | {"marital_status": ["widowed"]}
  3 | {never_married,divorced} | {"marital_status": ["never_married", "divorced"]}
...

Run Code Online (Sandbox Code Playgroud)

表中随机组合中有超过690K条记录.

在text []列上查找

EXPLAIN ANALYZE SELECT marital_status
FROM targeting
WHERE marital_status @> '{widowed}'::text[]

Run Code Online (Sandbox Code Playgroud)

没有索引

通常需要<900毫秒而不创建任何索引:

Seq Scan on targeting  (cost=0.00..172981.38 rows=159061 width=28) (actual time=0.017..840.084 rows=158877 loops=1)
  Filter: (marital_status @> '{widowed}'::text[])
  Rows Removed by Filter: 452033
Planning time: 0.150 ms
Execution time: 845.731 ms

Run Code Online (Sandbox Code Playgroud)

有索引

使用索引通常需要<200毫秒(75%的改进):

CREATE INDEX targeting_marital_status_idx ON targeting ("marital_status");

Run Code Online (Sandbox Code Playgroud)

结果:

Index Only Scan using targeting_marital_status_idx on targeting  (cost=0.42..23931.35 rows=159061 width=28) (actual time=3.528..143.848 rows=158877 loops=1)"
  Filter: (marital_status @> '{widowed}'::text[])
  Rows Removed by Filter: 452033
  Heap Fetches: 0
Planning time: 0.217 ms
Execution time: 148.506 ms

Run Code Online (Sandbox Code Playgroud)

在jsonb列上查找

EXPLAIN ANALYZE SELECT data
FROM targeting
WHERE (data -> 'marital_status') @> '["widowed"]'::jsonb

Run Code Online (Sandbox Code Playgroud)

没有索引

通常需要<5,700ms而不创建任何索引(慢6倍以上!):

Seq Scan on targeting  (cost=0.00..174508.65 rows=611 width=403) (actual time=0.095..5399.112 rows=158877 loops=1)
  Filter: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
  Rows Removed by Filter: 452033
Planning time: 0.172 ms
Execution time: 5408.326 ms

Run Code Online (Sandbox Code Playgroud)

有索引

有了索引,它通常需要<3,700毫秒(改善35%):

CREATE INDEX targeting_data_marital_status_idx ON targeting USING GIN ((data->'marital_status'));

Run Code Online (Sandbox Code Playgroud)

结果:

Bitmap Heap Scan on targeting  (cost=144.73..2482.75 rows=611 width=403) (actual time=85.966..3694.834 rows=158877 loops=1)
  Recheck Cond: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
  Rows Removed by Index Recheck: 201080
  Heap Blocks: exact=33723 lossy=53028
  ->  Bitmap Index Scan on targeting_data_marital_status_idx  (cost=0.00..144.58 rows=611 width=0) (actual time=78.851..78.851 rows=158877 loops=1)"
        Index Cond: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
Planning time: 0.257 ms
Execution time: 3703.492 ms

Run Code Online (Sandbox Code Playgroud)

问题

text[]即使不使用索引,为什么列的性能更高？
为什么在jsonb列中添加索引只会使性能提高35%？
是否有更为执行的方法在jsonb列上进行查找？

Answer 1

drs*_*drs 0

jsonb_ops这可能是使用（默认的 GIN 索引策略）而不是的问题jsonb_path_ops。

根据文档： https ://www.postgresql.org/docs/9.6/static/datatype-json.html

尽管该jsonb_path_ops运算符类仅支持使用@>运算符进行查询，但它比默认运算符类具有显着的性能优势jsonb_ops。索引jsonb_path_ops通常比相同数据上的索引小得多jsonb_ops，并且搜索的特异性更好，特别是当查询包含数据中频繁出现的键时。因此，搜索操作通常比使用默认运算符类执行得更好。

jsonb_opsA和GIN索引的技术区别在于jsonb_path_ops，前者为数据中的每个键和值创建独立的索引项，而后者只为数据中的每个值创建索引项。[1] 基本上，每个 jsonb_path_ops 索引项都是值和指向它的键的哈希值；例如，对于 index {"foo": {"bar": "baz"}}，将创建一个索引项，将 foo、bar 和 baz 的所有三个合并到哈希值中。因此，寻找该结构的包含查询将导致极其具体的索引搜索；但根本没有办法知道 foo 是否作为键出现。另一方面，jsonb_ops索引会创建三个分别代表 foo、bar 和 baz 的索引项；然后，为了执行包含查询，它将查找包含所有这三个项目的行。虽然 GIN 索引可以相当有效地执行此类 AND 搜索，但它仍然比等效jsonb_path_ops搜索不太具体且速度较慢，尤其是在存在大量包含三个索引项中的任何一项的行时。

归档时间：	9 年，1 月前
查看次数：	1122 次
最近记录：	8 年，9 月前