MongoDB全文搜索分数“分数是什么意思?”

Nas*_*tim 6 algorithm full-text-search mongodb

我正在为我的学校做一个 MongoDB 项目。我有一个句子集合,我做了一个普通的文本搜索来找到集合中最相似的句子,这是基于评分的。

我运行这个查询

db.sentences.find({$text: {$search: "any text"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})
Run Code Online (Sandbox Code Playgroud)

当我查询句子时,看看这些结果,

"that kicking a dog causes it pain"
----Matched With
"that kicking a dog causes it pain – is not very controversial."
----Give a Result of:
*score: 2.4*


"This sentence have nothing to do with any other"
----Matched With
"Who is the “He” in this sentence?"
----Give a result of:
*Score: 1.0* 
Run Code Online (Sandbox Code Playgroud)

分值是多少?这是什么意思?如果我想展示只有 70% 及以上相似度的结果怎么办?

我如何解释分数结果以便我可以显示相似度百分比,我使用 C# 来做到这一点,但不要担心实现。我不介意伪代码解决方案!

小智 7

当您使用 MongoDB 文本索引时,它会为每个匹配文档生成一个分数。该分数表明您的搜索字符串与文档的匹配程度。分数越高,与搜索文本相似的机会就越大。分数的计算方式为:

\n
Step 1: Let the search text = S\nStep 2: Break S into tokens (If you are not doing a Phrase search). Let\'s say T1, T2..Tn. Apply Stemming to each token\nStep 3: For every search token, calculate score per index field of text index as follows:\n       \nscore = (weight * data.freq * coeff * adjustment);\n       \nWhere :\nweight = user Defined Weight for any field. Default is 1 when no weight is specified\ndata.freq = how frequently the search token appeared in the text\ncoeff = \xe2\x80\x8b(0.5 * data.count / numTokens) + 0.5\ndata.count = Number of matching token\nnumTokens = Total number of tokens in the text\nadjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1\nStep 4: Final score of document is calculated by adding all tokens scores per text index field\nTotal Score = score(T1) + score(T2) + .....score(Tn)\n
Run Code Online (Sandbox Code Playgroud)\n

正如我们在上面看到的,分数受到以下因素的影响:

\n
    \n
  1. 与实际搜索文本匹配的Term数量,匹配越多得分越高
  2. \n
  3. 文档字段中的标记数量
  4. \n
  5. 搜索到的文本是否与文档字段完全匹配
  6. \n
\n

以下是您的一份文档的推导:

\n
Search String = This sentence have nothing to do with any other\nDocument = Who is the \xe2\x80\x9cHe\xe2\x80\x9d in this sentence?\n\nScore Calculation:\nStep 1: Tokenize search string.Apply Stemming and remove stop words.\n    Token 1: "sentence"\n    Token 2: "nothing"\nStep 2: For every search token obtained in Step 1, do steps 3-11:\n        \n      Step 3: Take Sample Document and Remove Stop Words\n            Input Document:  Who is the \xe2\x80\x9cHe\xe2\x80\x9d in this sentence?\n            Document after stop word removal: "sentence"\n      Step 4: Apply Stemming \n        Document in Step 3: "sentence"\n        After Stemming : "sentence"\n      Step 5: Calculate data.count per search token \n              data.count(sentence)= 1\n              data.count(nothing)= 1\n      Step 6: Calculate total number of token in document\n              numTokens = 1\n      Step 7: Calculate coefficient per search token\n              coeff = \xe2\x80\x8b(0.5 * data.count / numTokens) + 0.5\n              coeff(sentence) =\xe2\x80\x8b 0.5*(1/1) + 0.5 = 1.0\n              coeff(nothing) =\xe2\x80\x8b 0.5*(1/1) + 0.5 = 1.0    \n      Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)\n              adjustment(sentence) = 1\n              adjustment(nothing) =\xe2\x80\x8b 1\n      Step 9: weight of field (1 is default weight)\n              weight = 1\n      Step 10: Calculate frequency of search token in document (data.freq)\n           For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.\n            a. Data.freq(sentence)= 1/(2^0) = 1\n            b. Data.freq(nothing)= 0\n      Step 11: Calculate score per search token per field:\n         score = (weight * data.freq * coeff * adjustment);\n         score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0\n         score(nothing) = (1 * 0 * 1.0 * 1.0) = 0\nStep 12: Add individual score for every token of search string to get total score\nTotal score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0 \n
Run Code Online (Sandbox Code Playgroud)\n

用同样的方法,可以推导出另一个。

\n

更详细的 MongoDB 分析请查看:\n Mongo 评分算法博客

\n


Dan*_*tta 2

文本搜索为索引字段中包含搜索词的每个文档分配一个分数。分数确定文档与给定搜索查询的相关性。

对于文档中的每个索引字段,MongoDB 将匹配数乘以权重并对结果求和。然后,MongoDB 使用此总和计算文档的分数。

索引字段的默认权重为 1。

https://docs.mongodb.com/manual/tutorial/control-results-of-text-search/

  • 与其抄袭,不如用例子来解释它会很有帮助。 (6认同)