teh*_*ter 19 text-processing nlp heuristics corpus stripping
我尝试过各种方法从Project Gutenberg文本中剥离许可证,用作语言学习项目的语料库,但我似乎无法想出一种无监督,可靠的方法.到目前为止,我提出的最好的启发式方法是剥离前二十八行和后一个398,它们适用于大量文本.关于我可以自动剥离文本的方法的任何建议(对于许多文本非常相似,但在每种情况下略有不同,以及一些不同的模板),以及如何验证该文本的建议文本已被准确删除,会非常有用.
hip*_*ail 11
我还想要一个工具来剥离Project Gutenberg页眉和页脚多年来使用自然语言处理而不会污染分析,并使用与etxt混合的样板.在阅读完这个问题后,我终于伸出手指,写了一个Perl过滤器,你可以通过它进入任何其他工具.
它是使用每行正则表达式作为状态机.它的编写很容易理解,因为速度不是典型的etexts大小的问题.到目前为止,它适用于我在这里的几十个etexts,但在野外肯定会有更多的变化需要添加.希望代码足够清晰,任何人都可以添加它:
#!/usr/bin/perl
# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010
use strict;
my $debug = 0;
my $state = 'beginning';
my $print = 0;
my $printed = 0;
while (1) {
$_ = <>;
last unless $_;
# strip UTF-8 BOM
if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
$_ = substr($_, 3);
}
if ($state eq 'beginning') {
if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
$state = 'normal pg header';
$debug && print "state: beginning -> normal pg header\n";
$print = 0;
} elsif (/^$/) {
$state = 'beginning blanks';
$debug && print "state: beginning -> beginning blanks\n";
} else {
die "unrecognized beginning: $_";
}
} elsif ($state eq 'normal pg header') {
if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
$state = 'end of normal header';
$debug && print "state: normal pg header -> end of normal pg header\n";
} else {
# body of normal pg header
}
} elsif ($state eq 'end of normal header') {
if (/^(Produced by|Transcribed from)/) {
$state = 'post header';
$debug && print "state: end of normal pg header -> post header\n";
} elsif (/^$/) {
# blank lines
} else {
$state = 'etext body';
$debug && print "state: end of normal header -> etext body\n";
$print = 1;
}
} elsif ($state eq 'post header') {
if (/^$/) {
$state = 'blanks after post header';
$debug && print "state: post header -> blanks after post header\n";
} else {
# multiline Produced / Transcribed
}
} elsif ($state eq 'blanks after post header') {
if (/^$/) {
# more blank lines
} else {
$state = 'etext body';
$debug && print "state: blanks after post header -> etext body\n";
$print = 1;
}
} elsif ($state eq 'beginning blanks') {
if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
$state = 'header include';
$debug && print "state: beginning blanks -> header include\n";
} elsif (/^Title: /) {
$state = 'aus header';
$debug && print "state: beginning blanks -> aus header\n";
} elsif (/^$/) {
# more blanks
} else {
die "unexpected stuff after beginning blanks: $_";
}
} elsif ($state eq 'header include') {
if (/^$/) {
# blanks after header include
} else {
$state = 'aus header';
$debug && print "state: header include -> aus header\n";
}
} elsif ($state eq 'aus header') {
if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
$state = 'end of aus header';
$debug && print "state: aus header -> end of aus header\n";
} elsif (/^A Project Gutenberg of Australia eBook$/) {
$state = 'end of aus header';
$debug && print "state: aus header -> end of aus header\n";
}
} elsif ($state eq 'end of aus header') {
if (/^((Title|Author): .*)?$/) {
# title, author, or blank line
} else {
$state = 'etext body';
$debug && print "state: end of aus header -> etext body\n";
$print = 1;
}
} elsif ($state eq 'etext body') {
# here's the stuff
if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
$state = 'footer';
$debug && print "state: etext body -> footer\n";
$print = 0;
} elsif (/^(\*\*\* ?)?end of (the )?project/i) {
$state = 'footer';
$debug && print "state: etext body -> footer\n";
$print = 0;
}
} elsif ($state eq 'footer') {
# nothing more of interest
} else {
die "unknown state '$state'";
}
if ($print) {
print;
++$printed;
} else {
$debug && print "## $_";
}
}
Run Code Online (Sandbox Code Playgroud)
你没有开玩笑。就好像他们试图让这项工作由人工智能完成一样。我只能想到两种方法,但都不是完美的。
1) 用 Perl 等语言建立一个脚本来处理最常见的模式(例如,查找短语“由……产生”,继续向下到下一个空行并在那里剪切),但放入大量关于什么是的断言预期的(例如下一个文本应该是标题或作者)。这样,当模式失败时,您就会知道。模式第一次失败时,请手动完成。第二次,修改脚本。
2) 尝试亚马逊的 Mechanical Turk。