sni*_*erd 4 python regex performance python-3.x
我有一套相当简单的要求.我有一个列表(长度为200万)的对象,每个对象都有2个需要重新编码的属性(其他属性不会更改)
ZERO ONE TWO ... TEN的值需要更改为其数值:1 2 ... 10
例子:
ONE MAIN STREET -> 1 MAIN STREET
BONE ROAD -> BONE ROAD
BUILDING TWO, THREE MAIN ROAD -> BUILDING 2, 3 MAIN ROAD
ELEVEN MAIN ST -> ELEVEN MAIN STREET
ONE HUNDRED FUNTOWN -> 1 HUNDRED FUNTOWN
Run Code Online (Sandbox Code Playgroud)
显然有些数字不会改变,有些数字很奇怪. 完全可以预料到的
我可以用下面的内容来完成所有工作.我的问题是,是否有一种聪明的方法可以使这一切运行得更快?我想过做一个中list的dictionaries密钥是字号码和值是数字,但我不认为这将有利于性能.或者re.compile每个正则表达式并将它们传递给这个函数?有什么聪明的想法让这个运行得更快?
def update_word_to_numeric(entrylist):
updated_entrylist = []
for theentry in entrylist:
theentry.addr_ln_1 = re.sub(r"\bZERO\b", "0", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bONE\b", "1", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bTWO\b", "2", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bTHREE\b", "3", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bFOUR\b", "4", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bFIVE\b", "5", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bSIX\b", "6", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bSEVEN\b", "7", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bEIGHT\b", "8", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bNINE\b", "9", theentry.addr_ln_1)
theentry.addr_ln_1 = re.sub(r"\bTEN\b", "10", theentry.addr_ln_1)
theentry.addr_ln_2 = re.sub(r"\bZERO\b", "0", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bONE\b", "1", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bTWO\b", "2", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bTHREE\b", "3", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bFOUR\b", "4", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bFIVE\b", "5", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bSIX\b", "6", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bSEVEN\b", "7", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bEIGHT\b", "8", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bNINE\b", "9", theentry.addr_ln_2)
theentry.addr_ln_2 = re.sub(r"\bTEN\b", "10", theentry.addr_ln_2)
updated_entrylist.append(theentry)
return updated_entrylist
Run Code Online (Sandbox Code Playgroud)
也许这只是一个很好的方法."足够好"的评论对我来说也很好:)
使用一个正则表达式而不是十个(我发现速度提高了3倍)要快得多:
def replace(match):
return {
"ZERO": "0",
"ONE": "1",
"TWO": "2",
"THREE": "3",
"FOUR": "4",
"FIVE": "5",
"SIX": "6",
"SEVEN": "7",
"EIGHT": "8",
"NINE": "9",
"TEN": "10",
}[match.group(1)]
pattern = re.compile(r"\b(ZERO|ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE|TEN)\b")
def update_word_to_numeric(entrylist):
updated_entrylist = []
for theentry in entrylist:
theentry.addr_ln_1 = pattern.sub(replace, theentry.addr_ln_1)
theentry.addr_ln_2 = pattern.sub(replace, theentry.addr_ln_2)
updated_entrylist.append(theentry)
return updated_entrylist
Run Code Online (Sandbox Code Playgroud)
我使用鲜为人知的功能将re.sub一个函数作为第二个参数:它将采用匹配对象并返回替换字符串.这样我们就可以查找替换字符串了.
我也习惯re.compile预编译正则表达式,这也改善了时间,但没有那么大的变化.
| 归档时间: |
|
| 查看次数: |
64 次 |
| 最近记录: |