Postgres 全文搜索按位置排名

d3v*_*kit 7 postgresql full-text-search

我有一个电影表,我想搜索标题并返回最接近的匹配项。

我认为全文搜索可能有用,但它似乎无法按单词的位置排序,尽管 postgres 知道该位置。这在 postgres 中可能吗?

这是我的查询:

SELECT collectibles.id, collectibles.title, ts_rank_cd(to_tsvector('english', collectibles.title), plainto_tsquery('old school')) AS score
FROM collectibles WHERE to_tsvector('english', collectibles.title) @@ plainto_tsquery('old school')
ORDER BY score DESC;
Run Code Online (Sandbox Code Playgroud)

以下是一些结果:(这是我能想到的最好的格式,抱歉!)

id | title | score

 - 277568 | Wilson Meadows: Live At The 15th Old School & Blues Festival | 0.1
 - 3545 | 5 Film Collection: Will Ferrell: Campaign / Old School (Unrtated Version) / Blades Of Glory / Roxbury / Semi-Pro | 0.1
 - 10366 | Alice Cooper: Old School: 1964-1974 (DVD/CD Combo) | 0.1
 - 13004 | American Classics: Old School (3-Disc Set) | 0.1
 - 13005 | American Classics: Old School: Classic Chevrolets | 0.1
 - 13006 | American Classics: Old School: Classic Travel Trailers | 0.1
 - 13007 | American Classics: Old School: Kings Of Kustomizing | 0.1
 - 14592 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Back-To-Back) | 0.1
 - 14593 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Side-By-Side) | 0.1
 - 20242 | Audiovisualize: Mixed By Addictive TV: Snake Worship Island / Corp. Inc. / Old School Futures / These Melodies / Robot War / ... | 0.1
 - 192057 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) | 0.1
 - 192058 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Back-To-Back) | 0.1
 - 192059 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Side-By-Side) | 0.1
 - 192060 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Back-To-Back) | 0.1
 - 192061 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Side-By-Side) | 0.1
 - 192062 | Old School (Warner Brothers/ R-Rated Version) | 0.1
 - 192063 | Old School (Warner Brothers/ R-Rated Version/ Blu-ray) | 0.1
 - 192064 | Old School (Warner Brothers/ Unrated Version) | 0.1
 - 192065 | Old School (Warner Brothers/ Unrated Version/ Blu-ray) | 0.1
 - 192066 | Old School Comedy (4-Pack): Atoll K / Jack And The Beanstalk / The Flying Deuces / Africa Screams | 0.1
 - 192067 | Old School Hip Hop Dance #1: Beginner | 0.1
 - 192068 | Old School Hip Hop Greatest | 0.1
 - 192069 | Old School Hip Hop: Run DMC & Flava Flav (2-Disc) | 0.1
 - 192070 | Old School Hits Movie Marathon Collection (3-Disc) | 0.1
 - 192071 | Old School Returns | 0.1
Run Code Online (Sandbox Code Playgroud)

所有这些的得分都是 0.1,但许多标题中的单词位置更接近字符串的前面。有什么办法可以将它们排名更高吗?不幸的是,字符串或 id 的长度并不是真正好的排名限定符。

Dmi*_*kov 5

这里需要对ts_rank(tsvector,tsquery,normalization factor)函数使用归一化。在下面的代码片段中,我使用了normalization= 1(将排名除以 1 + 文档长度的对数),但您可以将其调整为您真正需要的。这是示例:

WITH s(id,tsv) AS ( VALUES
  (1,to_tsvector('english','Alice Cooper: Old School: 1964-1974 (DVD/CD Combo)')),
  (2,to_tsvector('english','American Classics: Old School: Kings Of Kustomizing')),
  (3,to_tsvector('english','Old School Hip Hop Greatest')),
  (4,to_tsvector('english','Old School Returns'))
)
SELECT id,ts_rank(tsv,tsq,1) AS rank
FROM s,to_tsquery('english','old & school') tsq
ORDER BY rank DESC;
Run Code Online (Sandbox Code Playgroud)

结果:

 id |   rank    
----+-----------
  4 | 0.0495516
  3 | 0.0383384
  2 | 0.0353013
  1 | 0.0312636
(4 rows)
Run Code Online (Sandbox Code Playgroud)

  • 标准化更好,但我希望只根据单词的位置进行排名。考虑搜索“硅谷”:硅谷海盗 - 硅谷:美国经历 - 硅谷:完整第一季。理想情况下,前面带有硅谷的标题应该是第一个,但它不是最短的字符串。如果 postgres 不能做到这一点,那么我可能会使用标准化,然后切换到不同的搜索系统。 (2认同)