在Python 3中加速数百万的正则表达式替换

pda*_*ese 117 python regex string performance replace

我正在使用Python 3.5.2

我有两个清单

  • 大约750,000个"句子"的列表(长串)
  • 我希望从我的750,000个句子中删除大约20,000个"单词"的列表

所以,我必须循环750,000个句子并执行大约20,000次替换,但是只有我的话实际上是"单词"并且不是更大字符串的一部分.

我是通过预先编译我的文字来做到这一点的,这样它们就被\b元字符所包围

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Run Code Online (Sandbox Code Playgroud)

然后我循环我的"句子"

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list
Run Code Online (Sandbox Code Playgroud)

这个嵌套循环每秒处理大约50个句子,这很好,但是处理我的所有句子仍需要几个小时.

  • 有没有办法使用该str.replace方法(我认为更快),但仍然要求替换只发生在字边界

  • 或者,有没有办法加快re.sub方法?re.sub如果我的单词的长度大于句子的长度,我已经通过跳过来略微提高了速度,但这并没有太大的改进.

谢谢你的任何建议.

Lit*_*eye 114

你可以尝试的一件事就是编译一个单一的模式"\b(word1|word2|word3)\b".

因为re依赖于C代码来进行实际匹配,所以节省的费用可能非常大.

正如@pvg在评论中指出的那样,它也受益于单次传递匹配.

如果你的话不是正则表达式,那么Eric的回答会更快.

  • @Liteye你的建议把4小时的工作变成了4分钟的工作!我能够将所有20,000多个正则表达式加入到一个巨大的正则表达式中,而我的笔记本电脑并没有引起人们的注意.再次感谢. (38认同)
  • 这不仅仅是C impl(它有很大的不同),但你也可以匹配一次传球.这个问题的变体经常会出现,有点奇怪的是没有(或者可能存在,隐藏在某个地方?)这个非常明智的想法是一个规范的答案. (4认同)
  • @Bakuriu:`s /他们实际使用/他们实际上理论上有时可以使用/`.你有理由相信Python的实现除了循环之外还做了什么吗? (2认同)
  • @Bakuriu:我真的很想知道是否是这种情况,但我不认为正则表达式解决方案需要线性时间.如果它没有从联盟中建立一个Trie,我不知道它会如何发生. (2认同)
  • @Bakuriu:这不是理由.我问你是否有理由相信实施**实际上**的行为方式,而不是你是否有理由相信它*可以*表现那样.就个人而言,我还没有遇到过一种主流编程语言的正则表达式实现,它在线性时间内工作的方式与你对经典正则表达式的预期方式相同,所以如果你知道Python会这样做,你应该展示一些证据. (2认同)
  • 只是为了好玩,我用 Trie Regexp 写了一个 [answer](http://stackoverflow.com/a/42789508/6419007)。它比正则表达式联合快大约 1000 倍。 (2认同)

Eri*_*nil 111

TLDR

如果您想要最快的解决方案,请使用此方法(使用set lookup).对于类似于OP的数据集,它比接受的答案快约2000倍.

如果您坚持使用正则表达式进行查找,请使用此基于trie的版本,该版本仍然比正则表达式联合快1000倍.

理论

如果你的句子不是巨大的字符串,那么处理每秒超过50的句子可能是可行的.

如果将所有被禁止的单词保存到一个集合中,检查该集合中是否包含另一个单词将非常快.

将逻辑打包成一个函数,将此函数作为参数,re.sub然后完成!

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)
Run Code Online (Sandbox Code Playgroud)

转换后的句子是:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe
Run Code Online (Sandbox Code Playgroud)

注意:

  • 搜索不区分大小写(感谢lower())
  • 替换一个单词""可能会留下两个空格(如代码中所示)
  • 使用python3,\w+也匹配重音字符(例如"ångström").
  • 任何非单词字符(制表符,空格,换行符,标记......)都将保持不变.

性能

有一百万个句子,banned_words有近10万个单词,脚本运行不到7秒.

相比之下,Liteye的答案需要160秒才能完成1万个句子.

随着n是的话总amound和m明令禁止的单词的数量,OP的和Liteye的代码是O(n*m).

相比之下,我的代码应该运行O(n+m).考虑到句子比禁止的单词多得多,算法就变成了O(n).

正则表达式联合测试

使用'\b(word1|word2|...|wordN)\b'模式进行正则表达式搜索的复杂性是多少?难道O(N)还是O(1)

掌握正则表达式引擎的工作方式非常困难,所以让我们编写一个简单的测试.

此代码将10**i随机英语单词提取到列表中.它创建了相应的正则表达式联合,并使用不同的单词对其进行测试:

  • 一个显然不是一个词(它开头#)
  • 一个是列表中的第一个单词
  • 一个是列表中的最后一个单词
  • 一个看起来像一个字但不是


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))
Run Code Online (Sandbox Code Playgroud)

它输出:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms
Run Code Online (Sandbox Code Playgroud)

所以看起来像搜索带有'\b(word1|word2|...|wordN)\b'模式的单个单词有:

  • O(1) 最好的情况
  • O(n/2) 平均情况,这仍然是 O(n)
  • O(n) 最糟糕的情况

这些结果与简单的循环搜索一致.

正则表达式联合的一个更快的替代方法是从trie创建正则表达式模式.

  • 由于您删除了误导性的“O(1)”声明,因此您的答案绝对值得一票。 (3认同)
  • @idmean:是的,这不是很清楚。它只是指查找:“这个词是禁用词吗?”。 (2认同)
  • @EricDuminil:干得好!希望我能第二次投票。 (2认同)

Eri*_*nil 94

TLDR

如果您想要最快的基于正则表达式的解决方案,请使用此方法.对于类似于OP的数据集,它比接受的答案快大约1000倍.

如果你不关心正则表达式,请使用这个基于集合的版本,它比正则表达式联合快2000倍.

使用Trie优化正则表达式

一个简单的正则表达式工会的做法变得有许多禁忌词汇的慢,因为正则表达式引擎不会做了很好的工作优化格局.

可以使用所有被禁止的单词创建Trie并编写相应的正则表达式.由此产生的trie或regex实际上并不是人类可读的,但它们确实允许非常快速的查找和匹配.

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']
Run Code Online (Sandbox Code Playgroud)

正则表达联盟

该列表转换为trie:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

然后到这个正则表达式模式:

r"\bfoo(?:ba[hr]|xar|zap?)\b"
Run Code Online (Sandbox Code Playgroud)

正则表达式trie

巨大的优势是,为了测试是否zoo匹配,正则表达式引擎只需要比较第一个字符(它不匹配),而不是尝试5个字.这是5个单词的预处理矫枉过正,但它显示了数千个单词的有希望的结果.

请注意,使用(?:)非捕获组是因为:

这是一个略微修改的要点,我们可以将其用作trie.py库:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())
Run Code Online (Sandbox Code Playgroud)

测试

这是一个小测试(与相同):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))
Run Code Online (Sandbox Code Playgroud)

它输出:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms
Run Code Online (Sandbox Code Playgroud)

有关信息,正则表达式开头如下:

(:一(:(:\'S | A(:\????S |陈| liyah(:\?'S)| R(?:?dvark(:(?:?:\的| S ))| ON))| b'(:\? 'S | A(:C(:我们(:(:\???' S | ES))| [IK])|英尺|孤独的(? :(?:\ 'S | S)?)| NDON(:( ?:版| ING |换货(?:\?' S)| S))| S(:????E(:( ?:精神疾病:| [DS))| H(:( ?: E [DS] |荷兰国际集团))|荷兰国际集团)| T((\'S'):??????E(:( ?:包换( ?:\'S)| [DS]))| ING | toir(?:?(:\?的| S))))| b(:如(:???ID)| E(? :SS(:(:\ 'Y S | | ES)?(:(:\????)' |)| OT(S S)):(:\'S | T(:???\的)| S))| reviat?(:E [DS] | I(:???纳克|上(:(:\?的| S))))| Y(?:?\" ?S)|\E(:(:?\的| S)?))| d(:ICAT(:E [DS] | I(:?????纳克|上(:(:\的| S))))| OM(?:恩?(:(:\?'S | S))|伊纳勒)| U(?:?克拉(:( ?:版| I(?: NG |上(:(:\? 'S | S)))|或?(?:(:\?' S | S))| S))| L(:????\'S)) )| E(:(?:?:\ 'S |上午| L(:(:\?' S | ARD |子(?:\'S)))| R(?:???DEEN(:\ 'S)| nathy?(?:\' S)| RA(?:?NT |重刑(:(?:?:\'S | S))?))| T(:( ?:吨(?: E(:R(:(:?\'?S | S))| d)| ING |或(:(:?\的| S)))| S))| yance(???? :\ '?S)| d))| HOR(:( ?: R(:E(为:n(:CE(:?????\'?S)| T)| d)|荷兰国际集团)|或多个))| I?(:d(:E [DS] |荷兰国际集团|一月(:???\'S'))|盖尔| L(:?烯|它(?:IES | Y(?:\'?S))))|百灵(:ECT(:LY)| UR(:????通货膨胀(:(:?\'?S | S))| E [DS] |荷兰国际集团)) | L?(?:?一个(:略去(:(:???\的| S))| ZE)| E(:( ?: ST | R))| OOM | ution(:(? ?:\'S | S))| Y)| M \的| N(?:ΔE(:GAT(:E [DS] | I(:?纳克|上(?:?\'?S)))| R(:\" ?S))| ormal(:( ?:它(:IES | Y(?:?:\? 'S))| LY)))| O(?:??ARD |德(?:?(:\' ?S | S))|李(:SH(:( ?: E [DS] |荷兰国际集团))|和灰(:???(?:?\的| IST(:(:?\的|或多个)))))|米娜(:???BL [EY] | T(:E [DS] | I(:?????纳克|上(:(?:\的| S))) ))| R(:?igin(:人(:(:?\的| S))|?E(:(:?\的| S)))| T(:(???? :编辑| I(?:?纳克|上(:(:\ 'S | IST(:(?:?:\'???S | S))| S))|阳离子)|?S)))| U:T)| | VE(ND(:( ?:编| ING | S)?)?(:(:?\的|板)????))| R(:一个(:cadabra( ?:\'?S)| d(?:?E [DS] |荷兰国际集团)|火腿(:\?的)|米(?:?(?:\的| S))| SI(? :对(:(:\? 'S | S))|已经?(?:?(:\' S | LY |岬(?:?:\'S)| S))?))|东| IDG (:E(:( ?:换货(:(:????\的| S))| [DS])?)| ING |换货?(:(:?\的| S))? )| O(?:广告| GAT(:E [DS] | I(:?????纳克|上(:(?:\的| S))))?)| UPT(:( ?: E:LY | |岬(ST | R?)(:\?')))| S(S):?ALOM | C(:????ESS(:(:\的| E [DS] |荷兰国际集团))| ISSA(?:(:\的| [ES]))| OND(:( ?:编| ING | S)))| EN(:???????CE(:( ?:\'?S | S))| T(:( ?: E(:E(:(:\????的| ISM(:?\的)| S))|?d) | ING | LY | S)))| INTH(?:(:???\'?S | E(:\的)))| O(?:?L(:UT(:E(???? :(?:\的| LY | ST?))| I?(:上(:\?的)| SM?:))| v((\'S'):?E [DS] ?|荷兰国际集团))| R(:b(:( ?: E(为:n(:CY(:?????\'?S)| T(:(:?\的| S))? )| d)| ING | S))|?PT 一世...

这真是难以理解,但对于一个包含10万个被禁词的列表,这个Trie正则表达式比简单的正则表达式联盟快1000倍!

这是一个完整的trie图,用trie-python-graphviz和graphviz 导出twopi:

在此输入图像描述

  • @XavierCombelle:对的,我应该提到捕获小组:答案已经更新。不过,我也有相反的看法:使用`|`进行正则表达式更改时需要使用括号,但对于我们的目的根本不需要捕获组。他们只是放慢了速度,使用了更多的内存而没有收益。 (3认同)
  • @EricDuminil这篇文章很完美,非常感谢你:) (3认同)

Den*_*loe 14

您可能想要尝试的一件事是预处理句子以编码单词边界.基本上通过分割单词边界将每个句子转换为单词列表.

这应该更快,因为要处理一个句子,你只需要逐步检查每个单词并检查它是否匹配.

目前,正则表达式搜索每次都必须再次遍历整个字符串,查找字边界,然后在下一次传递之前"丢弃"此工作的结果.


peu*_*feu 8

嗯,这是一个快速简便的解决方案,带有测试集.

制胜战略:

re.sub("\ w +",repl,sentence)搜索单词.

"repl"可以是可调用的.我使用了一个执行dict查找的函数,dict包含要搜索和替换的单词.

这是最简单,最快速的解决方案(请参阅下面示例代码中的函数replace4).

次好的

这个想法是使用re.split将句子分成单词,同时保留分隔符以便稍后重构句子.然后,使用简单的dict查找完成替换.

(参见下面示例代码中的函数replace3).

计时功能:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)
Run Code Online (Sandbox Code Playgroud)

......和代码:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )
Run Code Online (Sandbox Code Playgroud)


kar*_*kfa 6

也许Python不是正确的工具.这是一个Unix工具链

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'
Run Code Online (Sandbox Code Playgroud)

假设您的黑名单文件已预处理,并添加了单词边界.步骤是:将文件转换为双倍行距,将每个句子分成每行一个单词,从文件中删除黑名单单词,然后合并行.

这应该至少快一个数量级.

用于从单词预处理黑名单文件(每行一个单词)

sed 's/.*/\\b&\\b/' words > blacklist
Run Code Online (Sandbox Code Playgroud)


Lie*_*yan 5

这个怎么样:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)
Run Code Online (Sandbox Code Playgroud)

These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes' solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn't compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don't care about it, it should be fairly straightforward to remove them from the output.

I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu's wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I've defined sentences as anything separated by ". ".

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005
Run Code Online (Sandbox Code Playgroud)

PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.