使用命令行和正则表达式来确定开始句子的单词

bas*_*sil 0 regex perl grep

我有文字:

 This is a test. This is only a test! If there were an emergency, then Information would be provided for you.
Run Code Online (Sandbox Code Playgroud)

我希望能够确定哪些单词开始句子.我现在拥有的是:

 $ cat <FILE> | perl -pe 's/[\s.?!]/\n/g;'
Run Code Online (Sandbox Code Playgroud)

这只是摆脱标点符号并用换行符替换它们,给我:

 This 
 is 
 a 
 test 

 This
 is
 only
 a
 test

 If
 there
 were
 an
 emergency,
 then
 Information
 would
 be
 provided
 for
 you
Run Code Online (Sandbox Code Playgroud)

从这里我可以以某种方式提取没有任何东西(文件的开头)或空白的单词,但我不确定如何做到这一点.

sim*_*que 6

如果您的Perl至少为版本5.22.1(或5.22.0并且此案例不受此处描述的错误影响),那么您可以在正则表达式中使用句子边界.

use feature 'say';

foreach my $sentence (m/\b{sb}(\w+)/g) {
    say $sentence;
}
Run Code Online (Sandbox Code Playgroud)

或者,作为一个单行:

perl -nE 'say for /\b{sb}(\w+)/g'
Run Code Online (Sandbox Code Playgroud)

如果使用示例文本调用,则输出为:

This
This
If
Run Code Online (Sandbox Code Playgroud)

它使用\b{sb},这是句子边界.您可以在brian d foy的博客上阅读有关它的教程.它\b{}被称为unicode边界,在perlrebackslash中描述.