wim*_*wim 6 python string logging parsing dictionary
有一个日志文件,其中包含以空格分隔key=value对形式的文本,每行最初都是根据Python dict中的数据序列化的,例如:
' '.join([f'{k}={v!r}' for k,v in d.items()])
Run Code Online (Sandbox Code Playgroud)
键总是只是字符串.值可以是任何ast.literal_eval可以成功解析的值,也可以是更少.
如何处理这个日志文件并将行转回Python dicts?例:
>>> to_dict("key='hello world'")
{'key': 'hello world'}
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> to_dict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
Run Code Online (Sandbox Code Playgroud)
以下是有关数据的一些额外背景信息:
编辑: 根据评论中的要求,这是一个MCVE和一个无法正常工作的示例代码
>>> def to_dict(s):
... s = s.replace(' ', ', ')
... return eval(f"dict({s})")
...
...
>>> to_dict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'} # OK
>>> to_dict("s='1234' n=1234")
{'s': '1234', 'n': 1234} # OK
>>> to_dict("key='hello world'")
{'key': 'hello, world'} # Incorrect, the value was corrupted
Run Code Online (Sandbox Code Playgroud)
您的输入无法通过类似的方式进行方便解析ast.literal_eval,但可以将其标记为一系列Python令牌.这使事情比它们可能更容易.
唯一=可以在输入中出现的令牌是键值分隔符; 至少现在,ast.literal_eval不接受任何带有=令牌的东西.我们可以使用=标记来确定键值对的开始和结束位置,并且可以处理其余大部分工作ast.literal_eval.使用该tokenize模块还可以避免=字符串文字中的问题或反斜杠转义.
import ast
import io
import tokenize
def todict(logstring):
# tokenize.tokenize wants an argument that acts like the readline method of a binary
# file-like object, so we have to do some work to give it that.
input_as_file = io.BytesIO(logstring.encode('utf8'))
tokens = list(tokenize.tokenize(input_as_file.readline))
eqsign_locations = [i for i, token in enumerate(tokens) if token[1] == '=']
names = [tokens[i-1][1] for i in eqsign_locations]
# Values are harder than keys.
val_starts = [i+1 for i in eqsign_locations]
val_ends = [i-1 for i in eqsign_locations[1:]] + [len(tokens)]
# tokenize.untokenize likes to add extra whitespace that ast.literal_eval
# doesn't like. Removing the row/column information from the token records
# seems to prevent extra leading whitespace, but the documentation doesn't
# make enough promises for me to be comfortable with that, so we call
# strip() as well.
val_strings = [tokenize.untokenize(tok[:2] for tok in tokens[start:end]).strip()
for start, end in zip(val_starts, val_ends)]
vals = [ast.literal_eval(val_string) for val_string in val_strings]
return dict(zip(names, vals))
Run Code Online (Sandbox Code Playgroud)
这在您的示例输入以及带有反斜杠的示例上表现正常:
>>> todict("key='hello world'")
{'key': 'hello world'}
>>> todict("k1='v1' k2='v2'")
{'k1': 'v1', 'k2': 'v2'}
>>> todict("s='1234' n=1234")
{'s': '1234', 'n': 1234}
>>> todict("""k4='k5="hello"' k5={'k6': ['potato']}""")
{'k4': 'k5="hello"', 'k5': {'k6': ['potato']}}
>>> s=input()
a='=' b='"\'' c=3
>>> todict(s)
{'a': '=', 'b': '"\'', 'c': 3}
Run Code Online (Sandbox Code Playgroud)
顺便说一句,我们可能会寻找令牌类型NAME而不是=令牌,但如果他们曾经添加set()支持就会中断literal_eval.寻找=也可能在未来打破,但它似乎不像寻找NAME令牌那样突破.