bok*_*kov 5 text nlp semantic-markup searchable
我有一些纯文本有点结构的剧本,格式如本文末尾的例子.我想将每个解析成某种格式,其中:
最明显的方法,我能想到的是使用sed或perl或php把div标签周围的每个块,与代表人物,地点类,无论是舞台指示或对话.然后,打开它作为一个网页,并使用jQuery拉出我感兴趣的任何东西.但这听起来像一个迂回的方式去做,也许它似乎只是一个好主意,因为这些是我习以为常的工具至.但我确信这是一个经常出现的问题,所以有人可以推荐一个可以在Linux机器上使用的更高效的工作流程吗?谢谢.
以下是一些示例输入:
SOMEWHERE CORPORATION - OPTIONAL COMMENT
A guy named BOB is sitting at his computer.
BOB
Mmmm. Stackoverflow. I like.
Footsteps are heard approaching.
ALICE
Where's that report you said you'd have for me?
Closeup of clock ticking.
BOB (looking up)
Huh? What?
ALICE
Some more dialogue.
Some more stage directions.
Run Code Online (Sandbox Code Playgroud)
以下是示例输出的样子:
<div class='scene somewhere_corporation'>
<div class='comment'>OPTIONAL COMMENT</div>
<div class='direction'>A guy named BOB is sitting at his computer.</div>
<div class='dialogue bob'>Mmmm. Stackoverflow. I like.</div>
<div class='direction'>Footsteps are heard approaching.</div>
<div class='dialogue alice'>Where's that report you said you'd have for me?</div>
<div class='direction'>Closeup of clock ticking.</div>
<div class='comment bob'>looking up</div>
<div class='dialogue bob'>Huh? What?</div>
<div class='dialogue alice'>Some more dialogue.</div>
<div class='direction'>Some more stage directions.</div>
</div>
Run Code Online (Sandbox Code Playgroud)
我使用DOM作为示例,但同样,只是因为这是我理解的东西.如果我怀疑,滚动你自己的正则表达式和jQuery不是最好的做法,那么我对这种类型的文本处理任务的最佳实践持开放态度.谢谢.