bl0*_*l0b 11 postgresql index full-text-search pattern-matching postgresql-9.3
所以我有这个包含 620 万条记录的表,我必须对列执行具有相似性的搜索查询。查询可以是:
SELECT "lca_test".* FROM "lca_test"
WHERE (similarity(job_title, 'sales executive') > 0.6)
AND worksite_city = 'los angeles'
ORDER BY salary ASC LIMIT 50 OFFSET 0
Run Code Online (Sandbox Code Playgroud)
可以在 where(year = X, worksite_state = N, status = 'certified',visa_class = Z) 中添加更多条件。
运行其中一些查询可能需要很长时间,超过 30 秒。有时超过一分钟。
EXPLAIN ANALYZE
前面提到的查询给了我这个:
Run Code Online (Sandbox Code Playgroud)Limit (cost=0.43..42523.04 rows=50 width=254) (actual time=9070.268..33487.734 rows=2 loops=1) -> Index Scan using index_lca_test_on_salary on lca_test (cost=0.43..23922368.16 rows=28129 width=254) (actual time=9070.265..33487.727 rows=2 loops=1) >>>> Filter: (((worksite_city)::text = 'los angeles'::text) AND (similarity((job_title)::text, 'sales executive'::text) > 0.6::double precision)) >>>> Rows Removed by Filter: 6330130 Total runtime: 33487.802 ms Total runtime: 33487.802 ms
我不知道我应该如何索引我的专栏以使其快速运行。
编辑:这是 postgres 版本:
x86_64-unknown-linux-gnu 上的 PostgreSQL 9.3.5,由 gcc (Debian 4.7.2-5) 4.7.2 编译,64 位
这是表定义:
Table "public.lca_test"
Column | Type | Modifiers | Storage | Stats target | Description
------------------------+-------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('lca_test_id_seq'::regclass) | plain | |
raw_id | integer | | plain | |
year | integer | | plain | |
company_id | integer | | plain | |
visa_class | character varying | | extended | |
employement_start_date | character varying | | extended | |
employement_end_date | character varying | | extended | |
employer_name | character varying | | extended | |
employer_address1 | character varying | | extended | |
employer_address2 | character varying | | extended | |
employer_city | character varying | | extended | |
employer_state | character varying | | extended | |
employer_postal_code | character varying | | extended | |
employer_phone | character varying | | extended | |
employer_phone_ext | character varying | | extended | |
job_title | character varying | | extended | |
soc_code | character varying | | extended | |
naic_code | character varying | | extended | |
prevailing_wage | character varying | | extended | |
pw_unit_of_pay | character varying | | extended | |
wage_unit_of_pay | character varying | | extended | |
worksite_city | character varying | | extended | |
worksite_state | character varying | | extended | |
worksite_postal_code | character varying | | extended | |
total_workers | integer | | plain | |
case_status | character varying | | extended | |
case_no | character varying | | extended | |
salary | real | | plain | |
salary_max | real | | plain | |
prevailing_wage_second | real | | plain | |
lawyer_id | integer | | plain | |
citizenship | character varying | | extended | |
class_of_admission | character varying | | extended | |
Indexes:
"lca_test_pkey" PRIMARY KEY, btree (id)
"index_lca_test_on_id_and_salary" btree (id, salary)
"index_lca_test_on_id_and_salary_and_year" btree (id, salary, year)
"index_lca_test_on_id_and_salary_and_year_and_wage_unit_of_pay" btree (id, salary, year, wage_unit_of_pay)
"index_lca_test_on_id_and_visa_class" btree (id, visa_class)
"index_lca_test_on_id_and_worksite_state" btree (id, worksite_state)
"index_lca_test_on_lawyer_id" btree (lawyer_id)
"index_lca_test_on_lawyer_id_and_company_id" btree (lawyer_id, company_id)
"index_lca_test_on_raw_id_and_visa_and_pw_second" btree (raw_id, visa_class, prevailing_wage_second)
"index_lca_test_on_raw_id_and_visa_class" btree (raw_id, visa_class)
"index_lca_test_on_salary" btree (salary)
"index_lca_test_on_visa_class" btree (visa_class)
"index_lca_test_on_wage_unit_of_pay" btree (wage_unit_of_pay)
"index_lca_test_on_worksite_state" btree (worksite_state)
"index_lca_test_on_year_and_company_id" btree (year, company_id)
"index_lca_test_on_year_and_company_id_and_case_status" btree (year, company_id, case_status)
"index_lcas_job_title_trigram" gin (job_title gin_trgm_ops)
"lca_test_company_id" btree (company_id)
"lca_test_employer_name" btree (employer_name)
"lca_test_id" btree (id)
"lca_test_on_year_and_companyid_and_wage_unit_and_salary" btree (year, company_id, wage_unit_of_pay, salary)
Foreign-key constraints:
"fk_rails_8a90090fe0" FOREIGN KEY (lawyer_id) REFERENCES lawyers(id)
Has OIDs: no
Run Code Online (Sandbox Code Playgroud)
Erw*_*ter 20
值得一提的是,您安装了附加模块pg_trgm
,它提供了该similarity()
功能。
%
无论您做什么,请使用相似性运算符%
而不是表达式(similarity(job_title, 'sales executive') > 0.6)
。索引支持绑定到Postgres 中的运算符,而不是函数。
要获得 所需的最小相似度0.6
,请设置GUC 参数:
SET pg_trgm.similarity_threshold = 0.6; -- once per session
Run Code Online (Sandbox Code Playgroud)
(在 Postgres 9.6 或更早版本中使用已弃用的SELECT set_limit(0.6);
)
该设置在您的会话的其余部分保持不变,直到重置。检查:
SHOW pg_trgm.similarity_threshold;
Run Code Online (Sandbox Code Playgroud)
(曾经是SELECT show_limit();
)
仅获得job_title
给定字符串的列中的最佳匹配将是“最近邻”搜索的简单情况,可以使用三元运算符类gist_trgm_ops
(但不能使用 GIN 索引)通过 GiST 索引解决:
CREATE INDEX trgm_idx ON lcas USING gist (job_title gist_trgm_ops);
Run Code Online (Sandbox Code Playgroud)
要还包含一个相等条件,worksite_city
您将需要额外的模块btree_gist
。运行(每个数据库一次):
CREATE EXTENSION btree_gist;
Run Code Online (Sandbox Code Playgroud)
然后:
CREATE INDEX lcas_trgm_gist_idx ON lcas USING gist (worksite_city, job_title gist_trgm_ops);
Run Code Online (Sandbox Code Playgroud)
询问:
SET pg_trgm.similarity_threshold = 0.6 -- once per session
SELECT *
FROM lca_test
WHERE job_title % 'sales executive'
AND worksite_city = 'los angeles'
ORDER BY (job_title <-> 'sales executive')
LIMIT 50;
Run Code Online (Sandbox Code Playgroud)
<->
作为“距离”运算符:
一减去
similarity()
值。
Postgres 还可以组合两个单独的索引,一个普通的 btree 索引worksite_city
和一个单独的 GiST 索引job_title
,但是多列索引应该是最快的 - 当像你一样组合两列时。
但是,您的查询按 排序salary
,而不是按距离或相似度排序,这完全是另一回事。现在我们可以同时使用 GIN 和 GiST 索引,而且 GIN 会更快。(在对 GIN 索引进行重大改进的更高版本中更是如此 - 升级提示!)
附加相等性检查的类似故事worksite_city
:安装附加模块btree_gin
。运行(每个数据库一次):
CREATE EXTENSION btree_gin;
Run Code Online (Sandbox Code Playgroud)
然后:
CREATE INDEX lcas_trgm_gin_idx ON lcas USING gin (worksite_city, job_title gin_trgm_ops);
Run Code Online (Sandbox Code Playgroud)
询问:
SET pg_trgm.similarity_threshold = 0.6; -- once per session
SELECT *
FROM lca_test
WHERE job_title % 'sales executive'
AND worksite_city = 'los angeles'
ORDER BY salary
LIMIT 50; -- OFFSET 0
Run Code Online (Sandbox Code Playgroud)
同样,这也适用于(效率较低)您已经拥有的更简单的索引 ( "index_lcas_job_title_trigram"
),可能与其他索引结合使用。最佳解决方案取决于完整图片。
进一步阅读:
旁白你有很多索引。您确定它们都在使用中并支付维护费用吗?
您有一些可疑的数据类型:
employement_start_date | character varying
employement_end_date | character varying
Run Code Online (Sandbox Code Playgroud)
似乎那些应该是date
。等等。