如何在文本文件上执行二进制搜索以在python中搜索关键字？

Question

如何在文本文件上执行二进制搜索以在python中搜索关键字？

App*_*pps 7 python binary search text-files

文本文件包含两列 - 索引号(5个空格)和字符(30个空格).它按字典顺序排列.我想执行二进制搜索来搜索关键字.

Answer 1

使用Python的内置bisect模块,这是一种有趣的方法.

import bisect
import os


class Query(object):

    def __init__(self, query, index=5):
        self.query = query
        self.index = index

    def __lt__(self, comparable):
        return self.query < comparable[self.index:]


class FileSearcher(object):

    def __init__(self, file_pointer, record_size=35):
        self.file_pointer = file_pointer
        self.file_pointer.seek(0, os.SEEK_END)
        self.record_size = record_size + len(os.linesep)
        self.num_bytes = self.file_pointer.tell()
        self.file_size = (self.num_bytes // self.record_size)

    def __len__(self):
        return self.file_size

    def __getitem__(self, item):
        self.file_pointer.seek(item * self.record_size)
        return self.file_pointer.read(self.record_size)


if __name__ == '__main__':
    with open('data.dat') as file_to_search:
        query = raw_input('Query: ')
        wrapped_query = Query(query)

        searchable_file = FileSearcher(file_to_search)
        print "Located @ line: ", bisect.bisect(searchable_file, wrapped_query)

Run Code Online (Sandbox Code Playgroud)

Answer 2

sam*_*ias 6

您需要进行二分搜索吗？如果没有，请尝试将平面文件转换为cdb（常量数据库）。这将为您提供非常快速的哈希查找来查找给定单词的索引：

import cdb

# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
    index, word = line.split()
    db.add(word, index)
db.finish()

Run Code Online (Sandbox Code Playgroud)

在单独的脚本中，对其运行查询：

import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345

Run Code Online (Sandbox Code Playgroud)

请注意，CDB 不支持大于 4GB 的文件，这对我来说终止了这个选项。考虑到现代架构，如果文件 < 4GB，您最好将其放在 RAM 中。 (2认同)

Answer 3

DTi*_*ing -1

考虑使用集合而不是二分搜索来查找文件中的关键字。

放：

创建 O(n)、查找 O(1)、插入/删除 O(1)

如果您的输入文件由空格分隔，则：

f = open('file')
keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) )
f.close()    

my_word in keywords
<returns True or False>

Run Code Online (Sandbox Code Playgroud)

字典：

f = open('file')
keywords = dict( [ (pair[1],pair[0]) for pair in  [line.strip().split(" ") for line in f.readlines()] ] ) 
f.close()

keywords[my_word]
<returns index of my_word>

Run Code Online (Sandbox Code Playgroud)

二分查找是：

O(n log n) 创建，O(log n) 查找

编辑：对于 5 个字符和 30 个字符的情况，您可以只使用字符串切片

f = open('file')
keywords = set( (line[5:-1] for line in f.readlines()) )
f.close()

myword_ in keywords

or 

f = open('file')
keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] )
f.close()

keywords[my_word]

Run Code Online (Sandbox Code Playgroud)

嘿，但是 O(n) 不是比 O(log n) 更昂贵吗？我需要在一个巨大的语料库上运行它。 (2认同)

归档时间：	14 年，10 月前
查看次数：	4922 次
最近记录：	12 年前