sar*_*eph 3 python nlp machine-learning inverted-index
我有一个反向索引的代码如下。但是我对它不太满意,并且想知道如何使它更紧凑和更pythonic
class invertedIndex(object):
def __init__(self,docs):
self.docs,self.termList,self.docLists=docs,[],[]
for index,doc in enumerate(docs):
for term in doc.split(" "):
if term in self.termList:
i=self.termList.index(term)
if index not in self.docLists[i]:
self.docLists[i].append(index)
else:
self.termList.append(term)
self.docLists.append([index])
def search(self,term):
try:
i=self.termList.index(term)
return self.docLists[i]
except:
return "No results"
docs=["new home sales top forecasts june june june",
"home sales rise in july june",
"increase in home sales in july",
"july new home sales rise"]
i=invertedIndex(docs)
print invertedIndex.search("sales")
Run Code Online (Sandbox Code Playgroud)
将文档索引存储在Python 集中,并使用字典来引用每个术语的“文档集”。
from collections import defaultdict
class invertedIndex(object):
def __init__(self,docs):
self.docSets = defaultdict(set)
for index, doc in enumerate(docs):
for term in doc.split():
self.docSets[term].add(index)
def search(self,term):
return self.docSets[term]
docs=["new home sales top forecasts june june june",
"home sales rise in july june",
"increase in home sales in july",
"july new home sales rise"]
i=invertedIndex(docs)
print i.search("sales") # outputs: set([0, 1, 2, 3])
Run Code Online (Sandbox Code Playgroud)
set 工作原理类似于列表,但无序且不能包含重复的条目。
defaultdict基本上是一个dict,当没有可用数据时(在这种情况下为空集),它具有默认类型。