Python中已编译的正则表达式列表

roy*_*att 1 python regex

我有很多替换模式,我需要进行文本清理.我出于性能原因从数据库加载数据并编译正则表达式.不幸的是,在我的方法中,只有变量"text"的最后一个赋值似乎是有效的,而其他的似乎被覆盖了:

# -*- coding: utf-8 -*-
import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
        text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
        print text

extract_from_db()
Run Code Online (Sandbox Code Playgroud)

有谁知道如何以一种优雅的工作方式解决这个问题?或者我是否必须通过巨大的if/elif控制结构来解决这个问题?

Mar*_*ers 7

你继续用替换替换最后的结果str(row[0]).text相反,使用累积替换:

text = REPLACE_1.sub(r'REPLACE_1', str(row[0]))
text = REPLACE_1.sub(r'REPLACE_1', text)
# ..
text = REPLACE_99.sub(r'REPLACE_99', text)
text = REPLACE_100.sub(r'REPLACE_199', text)
Run Code Online (Sandbox Code Playgroud)

您最好使用实际列表:

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), r'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), r'REPLACE_2'),
    # ..
    (re.compile(r'(sample_pattern_99)'), r'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), r'REPLACE_100'),
]
Run Code Online (Sandbox Code Playgroud)

并在循环中使用它们:

text = str(row[0])
for pattern, replacement in REPLACEMENTS:
    text = pattern.sub(replacement, text)
Run Code Online (Sandbox Code Playgroud)

或者functools.partial()用来进一步简化循环:

from functools import partial

REPLACEMENTS = [
    partial(re.compile(r'(sample_pattern_1)').sub, r'REPLACE_1'),
    partial(re.compile(r'(sample_pattern_2)').sub, r'REPLACE_2'),
    # ..
    partial(re.compile(r'(sample_pattern_99)').sub, r'REPLACE_99'),
    partial(re.compile(r'(sample_pattern_100)').sub, r'REPLACE_100'),
]
Run Code Online (Sandbox Code Playgroud)

和循环:

text = str(row[0])
for replacement in REPLACEMENTS:
    text = replacement(text)
Run Code Online (Sandbox Code Playgroud)

或者使用包含在partial()对象中的上述模式列表,并且reduce():

text = reduce(lambda txt, repl: repl(txt), REPLACEMENTS, str(row[0])
Run Code Online (Sandbox Code Playgroud)