dd9*_*90p 11 python ranking gensim
我发现gensim有BM25排名功能.但是,我找不到教程如何使用它.
就我而言,我有一个查询.从搜索引擎中检索到的一些文档.如何使用gensim BM 25排名来比较查询和文档,找到最相似的一个?
我是gensim的新手.谢谢.
查询:
"experimental studies of creep buckling ."
Run Code Online (Sandbox Code Playgroud)
文件1:
" the 7 x 7 in . hypersonic wind tunnel at rae farnborough, part 1, design, instrumentation and flow visualization techniques . this is the first of three parts of the calibration report on the r.a.e. some details of the design and lay-out of the plant are given, together with the calculated performance figures, and the major components of the facility are briefly described . the instrumentation provided for the wind-tunnel is described in some detail, including the optical and other methods of flow visualization used in the tunnel . later parts will describe the calibration of the flow in the working-section, including temperature measurements . a discussion of the heater performance will also be included as well as the results of tests to determine starting and running pressure ratios, blockage effects, model starting loads, and humidity of the air flow ."
Run Code Online (Sandbox Code Playgroud)
文件2:
" the 7 in. x 7 in. hypersonic wind tunnel at r.a.e. farnborough part ii. heater performance . tests on the storage heater, which is cylindrical in form and mounted horizontally, show that its performance is adequate for operation at m=6.8 and probably adequate for flows at m=8.2 with the existing nozzles . in its present state, the maximum design temperature of 680 degrees centigrade for operation at m=9 cannot be realised in the tunnel because of heat loss to the outlet attachments of the heater and quick-acting valve which form, in effect, a large heat sink . because of this heat loss there is rather poor response of stagnation temperature in the working section at the start of a run . it is hoped to cure this by preheating the heater outlet cone and the quick-acting valve . at pressures greater than about 100 p.s.i.g. free convection through the fibrous thermal insulation surrounding the heated core causes the top of the heater shell to become somewhat hotter than the bottom, which results in /hogging/ distortion of the shell . this free convection cools the heater core and a vertical temperature gradient is set up across it after only a few minutes at high pressure . modifications to be incorporated in the heater to improve its performance are described ."
Run Code Online (Sandbox Code Playgroud)
文件3:
" supersonic flow at the surface of a circular cone at angle of attack . formulas for the inviscid flow properties on the surface of a cone at angle of attack are derived for use in conjunction with the m.i.t. cone tables . these formulas are based upon an entropy distribution on the cone surface which is uniform and equal to that of the shocked fluid in the windward meridian plane . they predict values for the flow variables which may differ significantly from the corresponding values obtained directly from the cone tables . the differences in the magnitudes of the flow variables computed by the two methods tend to increase with increasing free-stream mach number, cone angle and angle of attack ."
Run Code Online (Sandbox Code Playgroud)
文件4:
" theory of aircraft structural models subjected to aerodynamic heating and external loads . the problem of investigating the simultaneous effects of transient aerodynamic heating and external loads on aircraft structures for the purpose of determining the ability of the structure to withstand flight to supersonic speeds is studied . by dimensional analyses it is shown that .. constructed of the same materials as the aircraft will be thermally similar to the aircraft with respect to the flow of heat through the structure will be similar to those of the aircraft when the structural model is constructed at the same temperature as the aircraft . external loads will be similar to those of the aircraft . subjected to heating and cooling that correctly simulate the aerodynamic heating of the aircraft, except with respect to angular velocities and angular accelerations, without requiring determination of the heat flux at each point on the surface and its variation with time . acting on the aerodynamically heated structural model to those acting on the aircraft is determined for the case of zero angular velocity and zero angular acceleration, so that the structural model may be subjected to the external loads required for simultaneous simulation of stresses and deformations due to external loads ."
Run Code Online (Sandbox Code Playgroud)
    mke*_*rig 17
完全披露我没有任何使用BM25排名的经验,但是我对gensim的TF-IDF和LSI分布式模型以及gensim的相似性指数有相当多的经验.
作者在保持可读代码库方面做得非常好,所以如果你再遇到这样的问题,我建议你跳进源代码.
查看源代码:https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/summarization/bm25.py
所以我BM25()用你上面粘贴的文件初始化了一个对象.
看起来我们的老朋友Radim没有包含一个计算average_idf我们的功能,没有什么大不了的,我们可以为我们的事业将第65行铺平道路:
average_idf = sum(map(lambda k: float(bm25.idf[k]), bm25.idf.keys())) / len(bm25.idf.keys())
然后,如果我理解get_scores正确的初衷,你应该只需要通过这样做就每个BM25得分相对于你的原始查询
scores = bm_25_object.get_scores(query_doc, average_idf)
这将返回每个文档的所有分数,然后,如果我根据我在此维基百科页面上阅读的内容了解BM25排名:https://en.wikipedia.org/wiki/Okapi_BM25
您应该能够选择分数最高的文档,如下所示:
best_result = docs[scores.index(max(scores))]
那么第一个文档应该与您的查询最相关?我希望这是你所期待的,我希望这有助于一些能力.祝好运!
由于@mkerrig 的答案现已过时(2020 年)gensim 3.8.3,因此假设您有一个docs文档列表,那么这里是一种将 BM25 与 一起使用的方法。此代码返回最佳 10 个匹配文档的索引。
from gensim import corpora
from gensim.summarization import bm25
texts = [doc.split() for doc in docs] # you can do preprocessing as removing stopwords
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
bm25_obj = bm25.BM25(corpus)
query_doc = dictionary.doc2bow(query.split())
scores = bm25_obj.get_scores(query_doc)
best_docs = sorted(range(len(scores)), key=lambda i: scores[i])[-10:]
Run Code Online (Sandbox Code Playgroud)
请注意,您不再需要该average_idf参数。
@fonfonx 给出的答案是可行的。但这并不是使用 BM25 的自然方式。BM25 构造函数需要一个List[List[str]]. 这意味着它会获取一个标记化的语料库。
我觉得一个更好的例子应该是这样的:
from gensim.summarization.bm25 import BM25
corpus = ["The little fox ran home",
          "dogs are the best ",
          "Yet another doc ",
          "I see a little fox with another small fox",
          "last doc without animals"]
def simple_tok(sent:str):
    return sent.split()
tok_corpus = [simple_tok(s) for s in corpus]
bm25 = BM25(tok_corpus)
query = simple_tok("a little fox")
scores = bm25.get_scores(query)
best_docs = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]
for i, b in enumerate(best_docs):
    print(f"rank {i+1}: {corpus[b]}")
Run Code Online (Sandbox Code Playgroud)
输出:
>> rank 1: I see a little fox with another small fox
>> rank 2: The little fox ran home
>> rank 3: dogs are the best 
Run Code Online (Sandbox Code Playgroud)
        |   归档时间:  |  
           
  |  
        
|   查看次数:  |  
           9642 次  |  
        
|   最近记录:  |