我有文字:
This is a test. This is only a test! If there were an emergency, then Information would be provided for you.
Run Code Online (Sandbox Code Playgroud)
我希望能够确定哪些单词开始句子.我现在拥有的是:
$ cat <FILE> | perl -pe 's/[\s.?!]/\n/g;'
Run Code Online (Sandbox Code Playgroud)
这只是摆脱标点符号并用换行符替换它们,给我:
This
is
a
test
This
is
only
a
test
If
there
were
an
emergency,
then
Information
would
be
provided
for
you
Run Code Online (Sandbox Code Playgroud)
从这里我可以以某种方式提取没有任何东西(文件的开头)或空白的单词,但我不确定如何做到这一点.
如果您的Perl至少为版本5.22.1(或5.22.0并且此案例不受此处描述的错误影响),那么您可以在正则表达式中使用句子边界.
use feature 'say';
foreach my $sentence (m/\b{sb}(\w+)/g) {
say $sentence;
}
Run Code Online (Sandbox Code Playgroud)
或者,作为一个单行:
perl -nE 'say for /\b{sb}(\w+)/g'
Run Code Online (Sandbox Code Playgroud)
如果使用示例文本调用,则输出为:
This
This
If
Run Code Online (Sandbox Code Playgroud)
它使用\b{sb},这是句子边界.您可以在brian d foy的博客上阅读有关它的教程.它\b{}被称为unicode边界,在perlrebackslash中描述.