Python - 快速文件搜索

Question

Python - 快速文件搜索

apl*_*vin 4 python indexing search python-3.x

我有一个包含大量（0.5-150 万）行的文件，每行都是一个文件名（长度约为 50-100 个字符）。我需要的是通过给定查询快速搜索这些行。现在我的代码如下所示：

def similarity(haystack, needle):
    words = re.findall(r'\w+', haystack.lower()) # replacing by split with separators reduces time by about 4 seconds

    for word in words:
        if word == needle:
            return 10

    for word in words:
        if word.startswith(needle):
            return 10 ** (len(needle) / len(word))

    if needle in haystack:
        return 1

    return 0

def search(text):
    text = text.lower()
    lines = [(similarity(x, text), x) for x in lines]
    return [x[1] for x in sorted(lines, reverse = True)[:15]]

Run Code Online (Sandbox Code Playgroud)

它在我的 PC 上的示例文件上运行大约 15 秒（几乎所有时间都在 similarity()运行），我希望它在几秒钟内几乎立即运行。如何才能做到这一点？

我认为索引可能会有所帮助，但不知道其可能的结构。而且，如果可能的话，我希望搜索“更加模糊”——例如使用 N-grams 或类似的东西。但现在主要关注的是速度。

更新：

lines多次搜索相同的内容。

needle 始终是一个词。

“更模糊”意味着即使needle输入有点错误，它也应该找到行。

Answer 1

Len*_*bro 5

这一行什么都不做：

10 ** (len(t) / len(word))
您需要更好的变量名称，目前还不清楚“s”和“t”是什么。单字母变量名仅在数学中和作为循环变量是可接受的。是您正在寻找的，还是您正在寻找的？现在使用的功能对我来说没有多大意义。
由于您只匹配您搜索的任何内容的第一个匹配项，因此在某些情况下拆分是没有意义的，因此您可能会最后移动拆分，但这取决于您实际搜索的内容，这一点尚不清楚（参见 2）。

更新：要真正从中获得最佳性能，您需要进行分析、测试、分析和测试。但我建议将其作为第一个开始：

def similarity(haystack, needle):

    if needle not in haystack:
        return 0

    words = haystack.lower().split()

    if needle in words:
        return 10

    for word in words:
        if word.startswith(needle):
            return 10 ** (len(needle) / len(word))

    return 1

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，8 月前
查看次数：	4291 次
最近记录：	13 年，8 月前