jku*_*lak 2 sql postgresql full-text-search query-optimization trigram
我的tracks
表包含大约 300 万条记录(每天增长 500 条),大约有 30 列,但我只在WHERE
子句中使用 15 列。查询平均需要 4800 毫秒,没有其他用户/进程使用数据库。如何让它更快?我希望看到接近 100 毫秒的结果。
寻找歌曲(曲目)的人填写表格:
99% 的用例是 SELECT 查询:
SELECT
"public"."tracks"."sys_id",
"public"."tracks"."all_artists",
"public"."tracks"."name",
"public"."tracks"."genres",
"public"."tracks"."release_date",
"public"."tracks"."tempo",
"public"."tracks"."popularity",
"public"."tracks"."danceability",
"public"."tracks"."energy",
"public"."tracks"."speechiness",
"public"."tracks"."acousticness",
"public"."tracks"."instrumentalness",
"public"."tracks"."liveness",
"public"."tracks"."valence",
"public"."tracks"."main_artist_popularity",
"public"."tracks"."main_artist_followers",
"public"."tracks"."key",
"public"."tracks"."preview_url"
FROM
"public"."tracks"
WHERE
(
"public"."tracks"."name" LIKE '%oultec%'
OR "public"."tracks"."all_artists_string" LIKE '%oultec%'
)
AND ("public"."tracks"."genres_string" LIKE '%rum%')
AND "public"."tracks"."tempo" >= '80'
AND "public"."tracks"."tempo" <= '210'
AND "public"."tracks"."popularity" >= '0'
AND "public"."tracks"."popularity" <= '100'
AND "public"."tracks"."main_artist_popularity" >= '1'
AND "public"."tracks"."main_artist_popularity" <= '100'
AND "public"."tracks"."main_artist_followers" >= '1'
AND "public"."tracks"."main_artist_followers" <= '50000000'
AND "public"."tracks"."danceability" >= '0'
AND "public"."tracks"."danceability" <= '1000'
AND "public"."tracks"."energy" >= '0'
AND "public"."tracks"."energy" <= '1000'
AND "public"."tracks"."speechiness" >= '0'
AND "public"."tracks"."speechiness" <= '1000'
AND "public"."tracks"."acousticness" >= '0'
AND "public"."tracks"."acousticness" <= '1000'
AND "public"."tracks"."instrumentalness" >= '0'
AND "public"."tracks"."instrumentalness" <= '1000'
AND "public"."tracks"."liveness" >= '0'
AND "public"."tracks"."liveness" <= '1000'
AND "public"."tracks"."valence" >= '0'
AND "public"."tracks"."valence" <= '1000'
AND "public"."tracks"."release_date" >= '2020-01-01'
AND "public"."tracks"."key" = '10'
ORDER BY
"public"."tracks"."release_date" DESC,
"public"."tracks"."popularity" DESC,
"public"."tracks"."sys_id" ASC
LIMIT 5 OFFSET 0;
Run Code Online (Sandbox Code Playgroud)
索引(指数):
PRIMARY sys_id
UNIQUE main_artist, name, duration_ms
INDEX energy
INDEX tempo, popularity, main_artist_popularity, main_artist_followers, danceability, energy, speechiness, acousticness, instrumentalness, liveness, valence, name, all_artists_string, genres_string, release_date, key
Run Code Online (Sandbox Code Playgroud)
EXPLAIN
/ ANALYZE
:
PRIMARY sys_id
UNIQUE main_artist, name, duration_ms
INDEX energy
INDEX tempo, popularity, main_artist_popularity, main_artist_followers, danceability, energy, speechiness, acousticness, instrumentalness, liveness, valence, name, all_artists_string, genres_string, release_date, key
Run Code Online (Sandbox Code Playgroud)
PostgreSQL 从“官方”镜像运行postgres:14.1-alpine
::
表结构:
运行查询的网站(通过 API/后端,更多带有最小/最大整数的字段,但此处未显示):
您的查询LIKE '%something%'
对日期和数字进行全文搜索和范围扫描。但 BTREE 索引(默认情况下)只能处理一次范围扫描。LIKE '%something%'
而且,他们根本无法应对。因此,您将对每个查询进行全表扫描。考虑到三个巨型行的 4.8 秒还算不错。
对于您的column LIKE '%something%'
搜索,您可以尝试使用 trigram 索引,这是 postgreSQL 的一项功能。此代码将在 上创建三元组索引name
。这可能会缩小选择范围,从而使您必须扫描更少的数据。
CREATE EXTENSION pg_trgm; -- you may or may not need this statement.
CREATE INDEX CONCURRENTLY tracks_name
ON tracks
USING GIN (name gin_trgm_ops);
CREATE INDEX CONCURRENTLY tracks_all_artists_string
ON tracks
USING GIN (all_artists_string gin_trgm_ops);
CREATE INDEX CONCURRENTLY tracks_genres_string
ON tracks
USING GIN (genres_string gin_trgm_ops);
Run Code Online (Sandbox Code Playgroud)
但您仍然需要扫描所有匹配的曲目。
如果您创建这些索引,然后重构 WHERE 子句的前几位以使用像这样的集合计算,您可能(或可能不会)获得更好的性能。
WHERE sys_id IN (
(SELECT sys_id FROM tracks WHERE name LIKE '%oultec%'
UNION
SELECT sys_id FROM tracks WHERE all_artists_string LIKE '%oultec%'
)
INTERSECT
SELECT sys_id FROM tracks WHERE genres_string LIKE '%oultec%'
)
AND tempo >= '80' ...
Run Code Online (Sandbox Code Playgroud)
但事实是 SQL 并不适合所有这些范围扫描。