相似度函数的最佳索引

bl0*_*l0b 11 postgresql index full-text-search pattern-matching postgresql-9.3

所以我有这个包含 620 万条记录的表,我必须对列执行具有相似性的搜索查询。查询可以是:

 SELECT  "lca_test".* FROM "lca_test"
 WHERE (similarity(job_title, 'sales executive') > 0.6)
 AND worksite_city = 'los angeles' 
 ORDER BY salary ASC LIMIT 50 OFFSET 0
Run Code Online (Sandbox Code Playgroud)

可以在 where(year = X, worksite_state = N, status = 'certified',visa_class = Z) 中添加更多条件。

运行其中一些查询可能需要很长时间,超过 30 秒。有时超过一分钟。

EXPLAIN ANALYZE 前面提到的查询给了我这个:

Limit  (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1)
->  Index Scan using index_lca_test_on_salary on lca_test  (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1)
>>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision))
>>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms
Total runtime: 33487.802 ms
Run Code Online (Sandbox Code Playgroud)

我不知道我应该如何索引我的专栏以使其快速运行。

编辑:这是 postgres 版本:

x86_64-unknown-linux-gnu 上的 PostgreSQL 9.3.5,由 gcc (Debian 4.7.2-5) 4.7.2 编译,64 位

这是表定义:

                                                         Table "public.lca_test"
         Column         |       Type        |                       Modifiers                       | Storage  | Stats target | Description
------------------------+-------------------+-------------------------------------------------------+----------+--------------+-------------
 id                     | integer           | not null default nextval('lca_test_id_seq'::regclass) | plain    |              |
 raw_id                 | integer           |                                                       | plain    |              |
 year                   | integer           |                                                       | plain    |              |
 company_id             | integer           |                                                       | plain    |              |
 visa_class             | character varying |                                                       | extended |              |
 employement_start_date | character varying |                                                       | extended |              |
 employement_end_date   | character varying |                                                       | extended |              |
 employer_name          | character varying |                                                       | extended |              |
 employer_address1      | character varying |                                                       | extended |              |
 employer_address2      | character varying |                                                       | extended |              |
 employer_city          | character varying |                                                       | extended |              |
 employer_state         | character varying |                                                       | extended |              |
 employer_postal_code   | character varying |                                                       | extended |              |
 employer_phone         | character varying |                                                       | extended |              |
 employer_phone_ext     | character varying |                                                       | extended |              |
 job_title              | character varying |                                                       | extended |              |
 soc_code               | character varying |                                                       | extended |              |
 naic_code              | character varying |                                                       | extended |              |
 prevailing_wage        | character varying |                                                       | extended |              |
 pw_unit_of_pay         | character varying |                                                       | extended |              |
 wage_unit_of_pay       | character varying |                                                       | extended |              |
 worksite_city          | character varying |                                                       | extended |              |
 worksite_state         | character varying |                                                       | extended |              |
 worksite_postal_code   | character varying |                                                       | extended |              |
 total_workers          | integer           |                                                       | plain    |              |
 case_status            | character varying |                                                       | extended |              |
 case_no                | character varying |                                                       | extended |              |
 salary                 | real              |                                                       | plain    |              |
 salary_max             | real              |                                                       | plain    |              |
 prevailing_wage_second | real              |                                                       | plain    |              |
 lawyer_id              | integer           |                                                       | plain    |              |
 citizenship            | character varying |                                                       | extended |              |
 class_of_admission     | character varying |                                                       | extended |              |
Indexes:
    "lca_test_pkey" PRIMARY KEY, btree (id)
    "index_lca_test_on_id_and_salary" btree (id, salary)
    "index_lca_test_on_id_and_salary_and_year" btree (id, salary, year)
    "index_lca_test_on_id_and_salary_and_year_and_wage_unit_of_pay" btree (id, salary, year, wage_unit_of_pay)
    "index_lca_test_on_id_and_visa_class" btree (id, visa_class)
    "index_lca_test_on_id_and_worksite_state" btree (id, worksite_state)
    "index_lca_test_on_lawyer_id" btree (lawyer_id)
    "index_lca_test_on_lawyer_id_and_company_id" btree (lawyer_id, company_id)
    "index_lca_test_on_raw_id_and_visa_and_pw_second" btree (raw_id, visa_class, prevailing_wage_second)
    "index_lca_test_on_raw_id_and_visa_class" btree (raw_id, visa_class)
    "index_lca_test_on_salary" btree (salary)
    "index_lca_test_on_visa_class" btree (visa_class)
    "index_lca_test_on_wage_unit_of_pay" btree (wage_unit_of_pay)
    "index_lca_test_on_worksite_state" btree (worksite_state)
    "index_lca_test_on_year_and_company_id" btree (year, company_id)
    "index_lca_test_on_year_and_company_id_and_case_status" btree (year, company_id, case_status)
    "index_lcas_job_title_trigram" gin (job_title gin_trgm_ops)
    "lca_test_company_id" btree (company_id)
    "lca_test_employer_name" btree (employer_name)
    "lca_test_id" btree (id)
    "lca_test_on_year_and_companyid_and_wage_unit_and_salary" btree (year, company_id, wage_unit_of_pay, salary)
Foreign-key constraints:
    "fk_rails_8a90090fe0" FOREIGN KEY (lawyer_id) REFERENCES lawyers(id)
Has OIDs: no
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 20

值得一提的是,您安装了附加模块pg_trgm,它提供了该similarity()功能。

相似算子 %

无论您做什么,请使用相似性运算符%而不是表达式(similarity(job_title, 'sales executive') > 0.6)。索引支持绑定到Postgres 中的运算符,而不是函数。

要获得 所需的最小相似度0.6,请设置GUC 参数

SET pg_trgm.similarity_threshold = 0.6;  -- once per session
Run Code Online (Sandbox Code Playgroud)

(在 Postgres 9.6 或更早版本中使用已弃用的SELECT set_limit(0.6);
该设置在您的会话的其余部分保持不变,直到重置。检查:

SHOW pg_trgm.similarity_threshold;
Run Code Online (Sandbox Code Playgroud)

(曾经是SELECT show_limit();

简单案例

仅获得job_title给定字符串的列中的最佳匹配将是“最近邻”搜索的简单情况,可以使用三元运算符类gist_trgm_ops(但不能使用 GIN 索引)通过 GiST 索引解决:

CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);
Run Code Online (Sandbox Code Playgroud)

要还包含一个相等条件,worksite_city您将需要额外的模块btree_gist。运行(每个数据库一次):

CREATE EXTENSION btree_gist;
Run Code Online (Sandbox Code Playgroud)

然后:

CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);
Run Code Online (Sandbox Code Playgroud)

询问:

SET pg_trgm.similarity_threshold = 0.6  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY (job_title <-> 'sales executive')
LIMIT  50;
Run Code Online (Sandbox Code Playgroud)

<-> 作为“距离”运算符:

一减去similarity()值。

Postgres 还可以组合两个单独的索引,一个普通的 btree 索引worksite_city和一个单独的 GiST 索引job_title,但是多列索引应该是最快的 - 当像你一样组合两列时。

你的情况

但是,您的查询按 排序salary,而不是按距离或相似度排序,这完全是另一回事。现在我们可以同时使用 GIN 和 GiST 索引,而且 GIN 会更快。(在对 GIN 索引进行重大改进的更高版本中更是如此 - 升级提示!)

附加相等性检查的类似故事worksite_city:安装附加模块btree_gin。运行(每个数据库一次):

CREATE EXTENSION btree_gin;
Run Code Online (Sandbox Code Playgroud)

然后:

CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);
Run Code Online (Sandbox Code Playgroud)

询问:

SET pg_trgm.similarity_threshold = 0.6;  -- once per session

SELECT *
FROM   lca_test
WHERE  job_title % 'sales executive'
AND    worksite_city = 'los angeles' 
ORDER  BY salary 
LIMIT  50; -- OFFSET 0
Run Code Online (Sandbox Code Playgroud)

同样,这也适用于(效率较低)您已经拥有的更简单的索引 ( "index_lcas_job_title_trigram"),可能与其他索引结合使用。最佳解决方案取决于完整图片。

进一步阅读:

旁白
  • 你有很多索引。您确定它们都在使用中并支付维护费用吗?

  • 您有一些可疑的数据类型:

      employement_start_date | character varying
      employement_end_date   | character varying
    
    Run Code Online (Sandbox Code Playgroud)

似乎那些应该是date。等等。