优雅的结构化文本文件解析

rus*_*cle 20 ruby python perl text-parsing

我需要解析实时聊天对话的记录.我第一次看到该文件的想法是在问题上抛出正则表达式,但我想知道人们使用了什么其他方法.

我把优雅放在标题中,因为我之前发现这种类型的任务有可能难以维持只依赖正则表达式.

成绩单由www.providesupport.com生成并通过电子邮件发送到帐户,然后我从电子邮件中提取纯文本成绩单附件.

解析文件的原因是为了以后提取对话文本,还要识别访问者和运营商名称,以便通过CRM提供信息.

以下是成绩单文件的示例:

Chat Transcript

Visitor: Random Website Visitor 
Operator: Milton
Company: Initech
Started: 16 Oct 2008 9:13:58
Finished: 16 Oct 2008 9:45:44

Random Website Visitor: Where do i get the cover sheet for the TPS report?
* There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button
* Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.
Milton: Y-- Excuse me. You-- I believe you have my stapler?
Random Website Visitor: I really just need the cover sheet, okay?
Milton: it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire...
Random Website Visitor: oh i found it, thanks anyway.
* Random Website Visitor is now off-line and may not reply. Currently in room: Milton.
Milton: Well, Ok. But… that's the last straw.
* Milton has left the conversation. Currently in room:  room is empty.

Visitor Details
---------------
Your Name: Random Website Visitor
Your Question: Where do i get the cover sheet for the TPS report?
IP Address: 255.255.255.255
Host Name: 255.255.255.255
Referrer: Unknown
Browser/OS: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)
Run Code Online (Sandbox Code Playgroud)

小智 12

不,事实上,对于您描述的特定类型的任务,我怀疑有一种"清洁"的方式来做这个比正则表达式.看起来你的文件有嵌入的换行符,所以我们在这里做的通常是将行作为分解单元,应用每行正则表达式.同时,您创建一个小型状态机并使用正则表达式匹配来触发该状态机中的转换.通过这种方式,您可以了解文件中的位置以及可以预期的字符数据类型.另外,请考虑使用命名捕获组并从外部文件加载正则表达式.这样,如果你的成绩单的格式发生了变化,那么调整正则表达式就好了,而不是编写新的特定于解析的代码.


JDr*_*ago 11

使用Perl,您可以使用Parse :: RecDescent

它很简单,你的语法可以在以后维护.


use*_*714 6

您可能想要考虑一个完整的解析器生成器.

正则表达式适用于搜索小子串的文本,但如果您真的对将整个文件解析为有意义的数据感兴趣,那么它们的功能很差.

如果子串的上下文很重要,它们尤其不足.

大多数人都把正则表达式都放在一切,因为这就是他们所知道的.他们从未学过任何解析器生成工具,他们最终编写了许多生成规则组合和语义操作处理,您可以使用解析器生成器免费获得.

正则表达式很棒,但是如果你需要一个解析器,它们就无法替代.


jfs*_*jfs 6

这是基于lepl解析器生成器库的两个解析器.它们都产生相同的结果.

from pprint import pprint
from lepl import AnyBut, Drop, Eos, Newline, Separator, SkipTo, Space

# field = name , ":" , value
name, value = AnyBut(':\n')[1:,...], AnyBut('\n')[::'n',...]    
with Separator(~Space()[:]):
    field = name & Drop(':') & value & ~(Newline() | Eos()) > tuple

header_start   = SkipTo('Chat Transcript' & Newline()[2])
header         = ~header_start & field[1:] > dict
server_message = Drop('* ') & AnyBut('\n')[:,...] & ~Newline() > 'Server'
conversation   = (server_message | field)[1:] > list
footer_start   = 'Visitor Details' & Newline() & '-'*15 & Newline()
footer         = ~footer_start & field[1:] > dict
chat_log       = header & ~Newline() & conversation & ~Newline() & footer

pprint(chat_log.parse_file(open('chat.log')))
Run Code Online (Sandbox Code Playgroud)

更严格的解析器

from pprint import pprint
from lepl import And, Drop, Newline, Or, Regexp, SkipTo

def Field(name, value=Regexp(r'\s*(.*?)\s*?\n')):
    """'name , ":" , value' matcher"""
    return name & Drop(':') & value > tuple

Fields = lambda names: reduce(And, map(Field, names))

header_start   = SkipTo(Regexp(r'^Chat Transcript$') & Newline()[2])
header_fields  = Fields("Visitor Operator Company Started Finished".split())
server_message = Regexp(r'^\* (.*?)\n') > 'Server'
footer_fields  = Fields(("Your Name, Your Question, IP Address, "
                         "Host Name, Referrer, Browser/OS").split(', '))

with open('chat.log') as f:
    # parse header to find Visitor and Operator's names
    headers, = (~header_start & header_fields > dict).parse_file(f)
    # only Visitor, Operator and Server may take part in the conversation
    message = reduce(Or, [Field(headers[name])
                          for name in "Visitor Operator".split()])
    conversation = (message | server_message)[1:]
    messages, footers = ((conversation > list)
                         & Drop('\nVisitor Details\n---------------\n')
                         & (footer_fields > dict)).parse_file(f)

pprint((headers, messages, footers))
Run Code Online (Sandbox Code Playgroud)

输出:

({'Company': 'Initech',
  'Finished': '16 Oct 2008 9:45:44',
  'Operator': 'Milton',
  'Started': '16 Oct 2008 9:13:58',
  'Visitor': 'Random Website Visitor'},
 [('Random Website Visitor',
   'Where do i get the cover sheet for the TPS report?'),
  ('Server',
   'There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button'),
  ('Server',
   'Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.'),
  ('Milton', 'Y-- Excuse me. You-- I believe you have my stapler?'),
  ('Random Website Visitor', 'I really just need the cover sheet, okay?'),
  ('Milton',
   "it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire..."),
  ('Random Website Visitor', 'oh i found it, thanks anyway.'),
  ('Server',
   'Random Website Visitor is now off-line and may not reply. Currently in room: Milton.'),
  ('Milton', "Well, Ok. But… that's the last straw."),
  ('Server',
   'Milton has left the conversation. Currently in room:  room is empty.')],
 {'Browser/OS': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)',
  'Host Name': '255.255.255.255',
  'IP Address': '255.255.255.255',
  'Referrer': 'Unknown',
  'Your Name': 'Random Website Visitor',
  'Your Question': 'Where do i get the cover sheet for the TPS report?'})
Run Code Online (Sandbox Code Playgroud)


Gre*_*reg 5

构建解析器?我无法确定您的数据是否足够常规,但可能值得研究.


dal*_*ons 4

使用多行、带注释的正则表达式可以在一定程度上缓解维护问题。尝试避免使用一行超级正则表达式!

另外,请考虑将正则表达式分解为单独的任务,一个任务对应您想要获得的每个“事物”。例如。

visitor = text.find(/Visitor:(.*)/)
operator = text.find(/Operator:(.*)/)
body = text.find(/whatever....)
Run Code Online (Sandbox Code Playgroud)

代替

text.match(/Visitor:(.*)\nOperator:(.*)...whatever to giant regex/m) do
  visitor = $1
  operator = $2
  etc.
end
Run Code Online (Sandbox Code Playgroud)

这样就可以轻松更改任何特定项目的解析方式。至于解析具有许多“聊天块”的文件,只需使用一个简单的正则表达式来匹配单个聊天块,迭代文本并将匹配数据从该文本传递到您的其他匹配器组。

这显然会影响性能,但除非您处理巨大的文件,否则我不会担心。