我有很多替换模式,我需要进行文本清理.我出于性能原因从数据库加载数据并编译正则表达式.不幸的是,在我的方法中,只有变量"text"的最后一个赋值似乎是有效的,而其他的似乎被覆盖了:
# -*- coding: utf-8 -*-
import cx_Oracle
import re
connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")
# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')
def extract_from_db():
text = ''
for row in cursor:
# sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
# ..
text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
print text
extract_from_db()
Run Code Online (Sandbox Code Playgroud)
有谁知道如何以一种优雅的工作方式解决这个问题?或者我是否必须通过巨大的if/elif控制结构来解决这个问题?
你继续用替换替换最后的结果str(row[0]).text相反,使用累积替换:
text = REPLACE_1.sub(r'REPLACE_1', str(row[0]))
text = REPLACE_1.sub(r'REPLACE_1', text)
# ..
text = REPLACE_99.sub(r'REPLACE_99', text)
text = REPLACE_100.sub(r'REPLACE_199', text)
Run Code Online (Sandbox Code Playgroud)
您最好使用实际列表:
REPLACEMENTS = [
(re.compile(r'(sample_pattern_1)'), r'REPLACE_1'),
(re.compile(r'(sample_pattern_2)'), r'REPLACE_2'),
# ..
(re.compile(r'(sample_pattern_99)'), r'REPLACE_99'),
(re.compile(r'(sample_pattern_100)'), r'REPLACE_100'),
]
Run Code Online (Sandbox Code Playgroud)
并在循环中使用它们:
text = str(row[0])
for pattern, replacement in REPLACEMENTS:
text = pattern.sub(replacement, text)
Run Code Online (Sandbox Code Playgroud)
或者functools.partial()用来进一步简化循环:
from functools import partial
REPLACEMENTS = [
partial(re.compile(r'(sample_pattern_1)').sub, r'REPLACE_1'),
partial(re.compile(r'(sample_pattern_2)').sub, r'REPLACE_2'),
# ..
partial(re.compile(r'(sample_pattern_99)').sub, r'REPLACE_99'),
partial(re.compile(r'(sample_pattern_100)').sub, r'REPLACE_100'),
]
Run Code Online (Sandbox Code Playgroud)
和循环:
text = str(row[0])
for replacement in REPLACEMENTS:
text = replacement(text)
Run Code Online (Sandbox Code Playgroud)
或者使用包含在partial()对象中的上述模式列表,并且reduce():
text = reduce(lambda txt, repl: repl(txt), REPLACEMENTS, str(row[0])
Run Code Online (Sandbox Code Playgroud)