如果字符串匹配单词,短语,布尔AND列表中的任何术语,Python中最快的方法是什么？

joh*_*ati 5 python regex string algorithm pattern-matching

我试图在Python中找到一种快速方法来检查术语列表是否可以匹配大小从50到50,000个字符的字符串.

一个术语可以是:

一句话,例如.'苹果'
短语,例如.'樱桃派'
单词和短语的布尔AND,例如.'甜馅饼和咸味馅饼和蛋白酥皮'

匹配是词边界周围存在单词或短语的位置,因此:

match(term='apple', string='An apple a day.') # True
match(term='berry pie', string='A delicious berry pie.') # True
match(term='berry pie', string='A delicious blueberry pie.') # False

Run Code Online (Sandbox Code Playgroud)

我目前有大约40个术语,其中大部分都是简单的单词.术语的数量会随着时间的推移而增加,但我不希望它超过400.

我对字符串匹配的术语或者匹配的字符串中的哪个字段不感兴趣,我只需要一个匹配每个字符串的true/false值 - 更可能是没有术语匹配字符串,所以对于500匹配的地方,我可以存储字符串以便进一步处理.

速度是最重要的标准,我想利用那些比我聪明的代码,而不是试图实施白皮书.:)

到目前为止,我提出的最快速的解决方案是:

def data():
    return [
        "The apple is the pomaceous fruit of the apple tree, species Malus domestica in the rose family (Rosaceae).",
        "This resulted in early armies adopting the style of hunter-foraging.",
        "Beef pie fillings are popular in Australia. Chicken pie fillings are too."
    ]

def boolean_and(terms):
    return '(%s)' % (''.join(['(?=.*\\b%s\\b)' % (term) for term in terms]))

def run():
    words_and_phrases = ['apple', 'cherry pie']
    booleans = [boolean_and(terms) for terms in [['sweet pie', 'savoury pie', 'meringue'], ['chicken pie', 'beef pie']]]
    regex = re.compile(r'(?i)(\b(%s)\b|%s)' % ('|'.join(words_and_phrases), '|'.join(booleans)))
    matched_data = list()
    for d in data():
        if regex.search(d):
            matched_data.append(d)

Run Code Online (Sandbox Code Playgroud)

正则表达式如下:

(?i)(\b(apple|cherry pie)\b|((?=.*\bsweet pie\b)(?=.*\bsavoury pie\b)(?=.*\bmeringue\b))|((?=.*\bchicken pie\b)(?=.*\bbeef pie\b)))

Run Code Online (Sandbox Code Playgroud)

因此,所有术语都被OR在一起,大小写被忽略,单词/短语被包装在\ b中用于单词边界,布尔ANDs使用前瞻,以便所有术语都匹配,但它们不必按特定顺序匹配.

时间结果:

 print timeit.Timer('run()', 'from __main__ import run').timeit(number=10000)
 1.41534304619

Run Code Online (Sandbox Code Playgroud)

如果没有前瞻(即布尔AND),这真的很快,但一旦添加它们,速度就会大大减慢.

有没有人对如何改进这个有想法？有没有办法优化前瞻,或者可能采用完全不同的方法？我不认为词干会起作用,因为它与它匹配的东西往往有点贪心.

归档时间：	15 年，2 月前
查看次数：	5380 次
最近记录：	15 年，2 月前