如何使基本的倒排索引程序更pythonic

sar*_*eph 3 python nlp machine-learning inverted-index

我有一个反向索引的代码如下。但是我对它不太满意,并且想知道如何使它更紧凑和更pythonic

class invertedIndex(object):


  def __init__(self,docs):
     self.docs,self.termList,self.docLists=docs,[],[]

     for index,doc in enumerate(docs):

        for term in doc.split(" "):
            if term in self.termList:
                i=self.termList.index(term)
                if index not in self.docLists[i]:
                    self.docLists[i].append(index)

            else:
                self.termList.append(term)
                self.docLists.append([index])  

  def search(self,term):
        try:
            i=self.termList.index(term)
            return self.docLists[i]
        except:
            return "No results"





docs=["new home sales top forecasts june june june",
                     "home sales rise in july june",
                     "increase in home sales in july",
                     "july new home sales rise"]

i=invertedIndex(docs)
print invertedIndex.search("sales")
Run Code Online (Sandbox Code Playgroud)

Pet*_*son 5

将文档索引存储在Python 集中,并使用字典来引用每个术语的“文档集”。

from collections import defaultdict

class invertedIndex(object):

  def __init__(self,docs):
      self.docSets = defaultdict(set)
      for index, doc in enumerate(docs):
          for term in doc.split():
              self.docSets[term].add(index)

  def search(self,term):
        return self.docSets[term]

docs=["new home sales top forecasts june june june",
                     "home sales rise in july june",
                     "increase in home sales in july",
                     "july new home sales rise"]

i=invertedIndex(docs)
print i.search("sales") # outputs: set([0, 1, 2, 3])
Run Code Online (Sandbox Code Playgroud)

set 工作原理类似于列表,但无序且不能包含重复的条目。

defaultdict基本上是一个dict,当没有可用数据时(在这种情况下为空集),它具有默认类型。