替换文本文件中的标记列表的最佳方法

ngu*_*uvn 7 python token

我有一个文本文件(没有标点符号),文件大小约为 100MB - 1GB,这是一些示例行:

please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
...
Run Code Online (Sandbox Code Playgroud)

并带有替换令牌列表,如下所示:

check in -> check_in
full hd -> full_hd
bye bye -> bye_bye
ctrl c -> ctrl_c
...
Run Code Online (Sandbox Code Playgroud)

替换文本文件后我想要的输出如下:

please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help
...
Run Code Online (Sandbox Code Playgroud)

我目前的做法

please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
...
Run Code Online (Sandbox Code Playgroud)

此解决方案有效,但对于大量替换标记和大型文本文件来说,这非常慢。有没有更好的解决方案?

Dar*_*ylG 8

使用二进制文件和字符串替换如下

  • 将文件作为二进制处理以减少文件转换的开销
  • 使用字符串替换而不是正则表达式

代码

def process_binary(filename):
    """ Replace strings using binary and string replace
        Processing follows original code flow except using
        binary files and string replace """

    # Map using binary strings
    replace_tokens = {b'ctrl c': b'ctrl_c', b'full hd': b'full_hd', b'bye bye': b'bye_bye', b'check in': b'check_in'}

    outfile = append_id(filename, 'processed')

    with open(filename, 'rb') as fi, open(outfile, 'wb') as fo:
        for line in fi:
            for token in replace_tokens:
                line = line.replace(token, replace_tokens[token])
            fo.write(line)

def append_id(filename, id):
    " Convenience handler for generating name of output file "
    return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])
Run Code Online (Sandbox Code Playgroud)

性能比较

在 124 MB 文件上(通过复制发布的字符串生成):

  • 发布解决方案:82.8 秒
  • 避免正则表达式中的内循环(DAWG 帖子):28.2 秒
  • 当前解决方案:9.5秒

目前的解决方案:

  • 比发布的解决方案改进约 8.7 倍
  • 比正则表达式大约 3 倍(避免内循环)

总体趋势

使用基于 timeit 的 Perfplot 绘制曲线

测试代码

# Generate Data by replicating posted string
s = """please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
"""
with open('test_data.txt', 'w') as fo:
    for i in range(1000000):  # Repeat string 1M times
        fo.write(s)

# Time Posted Solution
from time import time
import re

def posted(filename):
    replace_tokens = {'ctrl c': 'ctrl_c', 'full hd': 'full_hd', 'bye bye': 'bye_bye', 'check in': 'check_in'}

    outfile = append_id(filename, 'posted')
    with open(filename, 'r') as fi, open(outfile, 'w') as fo:
        for line in fi:
            for token in replace_tokens:
                line = re.sub(r'\b{}\b'.format(token), replace_tokens[token], line)
            fo.write(line)

def append_id(filename, id):
    return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])

t0 = time()
posted('test_data.txt')
print('Elapsed time: ', time() - t0)
# Elapsed time:  82.84100198745728

# Time Current Solution
from time import time

def process_binary(filename):
    replace_tokens = {b'ctrl c': b'ctrl_c', b'full hd': b'full_hd', b'bye bye': b'bye_bye', b'check in': b'check_in'}

    outfile = append_id(filename, 'processed')
    with open(filename, 'rb') as fi, open(outfile, 'wb') as fo:
        for line in fi:
            for token in replace_tokens:
                line = line.replace(token, replace_tokens[token])
            fo.write(line)

def append_id(filename, id):
    return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])


t0 = time()
process_binary('test_data.txt')
print('Elapsed time: ', time() - t0)
# Elapsed time:  9.593998670578003

# Time Processing using Regex 
# Avoiding inner loop--see dawg posted answer

import re 

def process_regex(filename):
    tokens={"check in":"check_in", "full hd":"full_hd",
    "bye bye":"bye_bye","ctrl c":"ctrl_c"}

    regex=re.compile("|".join([r"\b{}\b".format(t) for t in tokens]))

    outfile = append_id(filename, 'regex')
    with open(filename, 'r') as fi, open(outfile, 'w') as fo:
        for line in fi:
            line = regex.sub(lambda m: tokens[m.group(0)], line)
            fo.write(line)

def append_id(filename, id):
    return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])

t0 = time()
process_regex('test_data.txt')
print('Elapsed time: ', time() - t0)
# Elapsed time:  28.27900242805481
Run Code Online (Sandbox Code Playgroud)


daw*_*awg 5

您至少可以通过执行以下操作来消除内部循环的复杂性:

import re 

tokens={"check in":"check_in", "full hd":"full_hd",
"bye bye":"bye_bye","ctrl c":"ctrl_c"}

regex=re.compile("|".join([r"\b{}\b".format(t) for t in tokens]))

with open(your_file) as f:
    for line in f:
        line=regex.sub(lambda m: tokens[m.group(0)], line.rstrip())
        print(line)
Run Code Online (Sandbox Code Playgroud)

印刷:

please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help
Run Code Online (Sandbox Code Playgroud)