我有一个需要分析的文本文件.文件中的每一行都是这种形式:
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj@nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3
Run Code Online (Sandbox Code Playgroud)
我需要跳过时间戳,(slbfd)并且只保留IN和OUT的行数.此外,根据引号中的名称,如果行开头,我需要增加不同变量OUT的变量计数,否则减少变量计数.我将如何在Python中执行此操作?
使用正则表达式和拆分行的其他答案将完成工作,但如果您想要一个可以随之增长的完全可维护的解决方案,那么您应该构建一个语法.我喜欢pyparsing这个:
S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj@nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3'''
from pyparsing import *
from collections import defaultdict
# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")
line = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))
# Now parsing is a piece of cake!
P = grammar.parseString(S)
counts = defaultdict(int)
for x in P:
if x.flag=="IN": counts[x.name] += 1
if x.flag=="OUT": counts[x.name] -= 1
for key in counts:
print key, counts[key]
Run Code Online (Sandbox Code Playgroud)
这给出了输出:
lq_viz_server 1
OFM32 -1
Run Code Online (Sandbox Code Playgroud)
如果您的示例日志文件更长,哪个会更令人印象深刻.pyparsing解决方案的优点是能够适应未来更复杂的查询(例如,抓取并解析时间戳,拉取电子邮件地址,解析错误代码......).这个想法是你编写独立于查询的语法 - 你只需将原始文本转换为计算机友好格式,从而将解析实现从其使用中抽象出来.