解析替换引号

B3t*_*th4 1 python regex quotes parsing nlp

我正在尝试解析一个文本文件,以便在python中对它进行一些统计.为此,我想用标记替换一些标点符号.这种令牌的一个例子是终止句子(.!?成为<EndS>)的所有标点符号.我设法使用正则表达式做到这一点.现在我正在尝试解析引号.因此,我认为,我需要一种方法来区分开盘报价和收盘价.我正在逐行读取输入文件,我无法保证报价将是平衡的.

例如:

 "Death to the traitors!" cried the exasperated burghers.
 "Go along with you," growled the officer, "you always cry the same thing over again. It is very tiresome."
Run Code Online (Sandbox Code Playgroud)

应该变得像:

 [Open] Death to the traitors! [Close] cried the exasperated burghers.
 [Open] Go along with you, [Close] growled the officer, [Open] you always cry the same thing over again. It is very tiresome. [Close]
Run Code Online (Sandbox Code Playgroud)

是否可以使用正则表达式执行此操作?有没有更容易/更好的方法来做到这一点?

小智 5

你可以使用sub方法(模块重新):

import re

def replace_dbquote(render):
    return '[OPEN]' + render.group(0).replace('"', '') + '[CLOSE]'

string = '"Death to the traitors!" cried the exasperated burghers. "Go along with you", growled the officer.'
parser = re.sub('"[^"]*"', replace_dbquote, string)

print(parser)
Run Code Online (Sandbox Code Playgroud)

https://docs.python.org/3.5/library/re.html#re.sub