将多段文档拆分为段落编号的句子

use*_*276 4 regex perl nlp text-segmentation

我有一个解析良好的多段文档列表(所有段落用\n \n分隔,句子用"."分隔),我想把它分成句子,还有一个数字表示段落中的段号.文献.例如,(两段)输入是:

First sentence of the 1st paragraph. Second sentence of the 1st paragraph. \n\n 

First sentence of the 2nd paragraph. Second sentence of the 2nd paragraph. \n\n
Run Code Online (Sandbox Code Playgroud)

理想情况下,输出应为:

1 First sentence of the 1st paragraph. 

1 Second sentence of the 1st paragraph. 

2 First sentence of the 2nd paragraph.

2 Second sentence of the 2nd paragraph.
Run Code Online (Sandbox Code Playgroud)

我熟悉Perl中可以将文档拆分成句子的Lingua :: Sentences包.但是它与段落编号不兼容.因此,我想知道是否有另一种方法来实现上述目标(文档不包含缩写).任何帮助是极大的赞赏.谢谢!

TLP*_*TLP 5

如果你可以依赖句点.作为分隔符,你可以这样做:

perl -00 -nlwe 'print qq($. $_) for split /(?<=\.)/' yourfile.txt
Run Code Online (Sandbox Code Playgroud)

说明:

  • -00 将输入记录分隔符设置为空字符串,即段落模式.
  • -l 将输出记录分隔符设置为输入记录分隔符,在这种情况下转换为两个换行符.

然后我们简单地用一个lookbehind断言分割句点并打印句子,前面是行号.