为什么对带有 GIN 索引的表进行全文搜索仍然很慢

Adj*_*con 5 postgresql index full-text-search

根据我迄今为止收集到的信息,如果您需要对 PostgreSQL(psql 9.6.2,服务器 9.6.5)数据库中包含大量条目(例如 1.2M+ 的订单)的表运行全文搜索),推荐的方法是为该表创建一个索引(在本例中我们创建了一个 GIN 索引),它应该允许您运行如下查询:

SELECT * FROM speech WHERE speech_tsv @@ plainto_tsquery('a text string')
Run Code Online (Sandbox Code Playgroud)

除了此查询的结果有时不包含任何相关搜索字符串之外,它通常需要 8 到 10 秒。

该数据库部署在一个相当大的多核 EC2 实例上,所以我在想,我们是否可以对数据库做其他事情来帮助这些查询运行得更快?

或者考虑到我们要求它搜索的大量文件和文本(即使通过索引),这个查询执行时间大约是合理的?

该表如下所示:

                                         Table "public.speech"
        Column     |            Type             |                      Modifiers                      
    ---------------+-----------------------------+-----------------------------------------------------
     speech_id     | integer                     | not null default nextval('speech_id_seq'::regclass)
     speechtype_id | smallint                    | not null
     title         | character varying           | not null default ''::character varying
     speechdate    | date                        | default now()
     location      | character varying           | not null default ''::character varying
     source        | character varying           | not null default ''::character varying
     speechtext    | text                        | not null
     url           | character varying           | not null default ''::character varying
     release_id    | smallint                    | 
     created       | timestamp without time zone | 
     modified      | timestamp without time zone | 
     speech_tsv    | tsvector                    | 
     key           | boolean                     | 
     summary       | text                        | 
     quote         | text                        | 
    Indexes:
        "speech_pk" PRIMARY KEY, btree (speech_id)
        "speech__release_id" btree (release_id)
        "speech__speech_tsv" gin (speech_tsv)
        "speech__speechdate" btree (speechdate)
        "speech__speechtype_id" btree (speechtype_id)

Foreign-key constraints:
    "speech__release_id_fk" FOREIGN KEY (release_id) REFERENCES release(release_id) MATCH FULL ON DELETE RESTRICT DEFERRABLE INITIALLY DEFERRED
    "speech__speechtype_id_fk" FOREIGN KEY (speechtype_id) REFERENCES speechtype(speechtype_id) MATCH FULL DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "factcheck_speech" CONSTRAINT "factcheck_speech_speech_id_fkey" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speech_candidate" CONSTRAINT "speech_candidate__speech_id_fk" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speech_category" CONSTRAINT "speech_category__speech_id_fk" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speech_tag" CONSTRAINT "speech_tag__speech_fk" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
    TABLE "speechlocking" CONSTRAINT "speechlocking__fkey" FOREIGN KEY (speech_id) REFERENCES speech(speech_id) MATCH FULL ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
Triggers:
    speech_updated BEFORE INSERT OR UPDATE ON speech FOR EACH ROW EXECUTE PROCEDURE pvs_speech_updated()
    update_speech_created BEFORE INSERT ON speech FOR EACH ROW EXECUTE PROCEDURE update_created_column()
    update_speech_modified BEFORE UPDATE ON speech FOR EACH ROW EXECUTE PROCEDURE update_modified_column()
Run Code Online (Sandbox Code Playgroud)

speechtext显然,该列包含要搜索的所有文本)

下面是一个EXPLAIN (ANALYZE,BUFFERS)直接在服务器上执行的示例查询(尽管这些查询实际上是在 Python 应用程序中执行的,因此它在这里运行得更快一些,没有网络延迟等):

                          QUERY PLAN                                                               
-------------------------------------------------------------------------
 Bitmap Heap Scan on speech  (cost=294.85..7931.12 rows=6142 width=1058) (actual time=400.623..67768.222 rows=27267 loops=1)
   Recheck Cond: (speech_tsv @@ plainto_tsquery('gun'::text))
   Heap Blocks: exact=23582
   Buffers: shared hit=2413 read=21424
   ->  Bitmap Index Scan on speech__speech_tsv  (cost=0.00..293.31 rows=6142 width=0) (actual time=279.709..279.709 rows=30535 loops=1)
         Index Cond: (speech_tsv @@ plainto_tsquery('gun'::text))
         Buffers: shared hit=241 read=14
 Planning time: 0.187 ms
 Execution time: 67778.684 ms
(9 rows)
Run Code Online (Sandbox Code Playgroud)

Mad*_*ist 1

如果你看一下解释输出,实际的索引扫描并不算慢,大约为 280 毫秒。缓慢的部分是获取您在第二步中请求的所有数据。

您在这里执行操作SELECT *,要求获得该表中的所有列。从解释输出来看,这是一个相当宽的表,有很多或很大的列。您的查询正在获取大约 27000 个大行。

Buffers 行的“read”和“hit”部分告诉您必须从硬盘驱动器或 SSD 读取 21424 个块,它们没有缓存在 RAM 中。当您从磁盘读取大量数据时,这将需要一些时间。

另一个因素是您要将所需的所有数据传输给客户端,这也需要时间。

您向数据库请求大量数据,但我怀疑您不需要所有这些数据。因此,您应该在查询中更加具体,只查询您实际需要的列,并添加一个LIMIT子句,除非您确实想要获取所有 27267 行。