new*_*new 1 perl text-segmentation
我在做Perl编程.我需要阅读一个段落并将每个句子打印出来作为一行.
谁知道怎么做?
以下是我的代码:
#! /C:/Perl64/bin/perl.exe
use utf8;
if (! open(INPUT, '< text1.txt')){
die "cannot open input file: $!";
}
if (! open(OUTPUT, '> output.txt')){
die "cannot open input file: $!";
}
select OUTPUT;
while (<INPUT>){
print "$_";
}
close INPUT;
close OUTPUT;
select STDOUT;
Run Code Online (Sandbox Code Playgroud)
我不会处理文件名,而是让Perl这样做.
这在多个层面上非常粗糙,而且完整的工作无疑是艰难的.
#!/usr/bin/env perl
use strict;
use warnings;
use Lingua::EN::Sentence qw(get_sentences);
sub normalize
{
my($str) = @_;
$str =~ s/\n/ /gm;
$str =~ s/\s\s+/ /gm;
return $str;
}
{
local $/ = "\n\n";
while (<>)
{
chomp;
print "Para: [[$_]]\n";
my @sentences = split m/(?<=[.!?])\s+/m, $_;
foreach my $sentence (@sentences)
{
$sentence = normalize $sentence;
print "Ad Hoc Sentence: $sentence\n";
}
my $sref = get_sentences($_);
foreach my $sentence (@$sref)
{
$sentence = normalize $sentence;
print "Lingua Sentence: $sentence\n";
}
}
}
Run Code Online (Sandbox Code Playgroud)
在split正则表达式查找由句号(期),感叹号或问号前面一个或多个空格,并在多个线路相匹配.后视(?<=[.!?])意味着标点符号与句子保持一致.该normalize函数简单地将换行符展平为空格,并将多个空格渲染为单个空格.(请注意,这不能正确识别括号句.)这将被视为前一句的一部分,因为.后面没有空格.
This is a paragraph with more than one sentence in it. How many will be
determined later. Mr. A. P. McDowney has been rather busy. This
incomplete sentence will still be counted as one
This is the second paragraph. With three sentences in it, it is a lot
less exciting than the first paragraph, but the middle sentence extends
over multiple lines and there is some wonky spacing too.
But 'tis time to finish.
Run Code Online (Sandbox Code Playgroud)
Para: [[This is a paragraph with more than one sentence in it. How many will be
determined later. Mr. A. P. McDowney has been rather busy. This
incomplete sentence will still be counted as one]]
Ad Hoc Sentence: This is a paragraph with more than one sentence in it.
Ad Hoc Sentence: How many will be determined later.
Ad Hoc Sentence: Mr.
Ad Hoc Sentence: A.
Ad Hoc Sentence: P.
Ad Hoc Sentence: McDowney has been rather busy.
Ad Hoc Sentence: This incomplete sentence will still be counted as one
Lingua Sentence: This is a paragraph with more than one sentence in it.
Lingua Sentence: How many will be determined later.
Lingua Sentence: Mr. A. P. McDowney has been rather busy.
Lingua Sentence: This incomplete sentence will still be counted as one
Para: [[This is the second paragraph. With three sentences in it, it is a lot
less exciting than the first paragraph, but the middle sentence extends
over multiple lines and there is some wonky spacing too.
But 'tis time to finish.
]]
Ad Hoc Sentence: This is the second paragraph.
Ad Hoc Sentence: With three sentences in it, it is a lot less exciting than the first paragraph, but the middle sentence extends over multiple lines and there is some wonky spacing too.
Ad Hoc Sentence: But 'tis time to finish.
Lingua Sentence: This is the second paragraph.
Lingua Sentence: With three sentences in it, it is a lot less exciting than the first paragraph, but the middle sentence extends over multiple lines and there is some wonky spacing too.
Lingua Sentence: But 'tis time to finish.
Run Code Online (Sandbox Code Playgroud)
请注意如何Lingua::EN::Sentence设法处理'先生 AP McDowney比简单的正则表达式更好.
| 归档时间: |
|
| 查看次数: |
2890 次 |
| 最近记录: |