dar*_*tch 1 scripting perl text lowercase capitalization
我如何更改以下马尔可夫脚本以将大写和小写单词视为相同?
整个想法是帮助提高马尔可夫文本生成器的输出质量.
就目前而言,如果您将99个小写句子插入其中并使用1个大写句子 - 您几乎总能在输出中找到大写句子的非标记化版本.
# Copyright (C) 1999 Lucent Technologies
# Excerpted from 'The Practice of Programming'
# by Brian W. Kernighan and Rob Pike
# markov.pl: markov chain algorithm for 2-word prefixes
$MAXGEN = 10000;
$NONWORD = "\n";
$w1 = $w2 = $NONWORD; # initial state
while (<>)
{ # read each line of input
foreach (split)
{
push(@{$statetab{$w1}{$w2}}, $_);
($w1, $w2) = ($w2, $_); # multiple assignment
}
}
push(@{$statetab{$w1}{$w2}}, $NONWORD); # add tail
$w1 = $w2 = $NONWORD;
for ($i = 0; $i < $MAXGEN; $i++)
{
$suf = $statetab{$w1}{$w2}; # array reference
$r = int(rand @$suf); # @$suf is number of elems
exit if (($t = $suf->[$r]) eq $NONWORD);
print "$t\n";
($w1, $w2) = ($w2, $t); # advance chain
}
Run Code Online (Sandbox Code Playgroud)
Nathan Fellman和mobrule都提出了一个常见的做法:规范化.
在进行作为程序或子程序的主要目标的实际计算之前,处理数据以使其符合预期的内容和结构规范通常更简单.
马尔可夫连锁计划很有意思,所以我决定玩它.
这是一个版本,允许您控制马尔可夫链中的层数.通过更改,$DEPTH您可以调整模拟的顺序.
我将代码分解为可重用的子例程.您可以通过更改规范化例程来修改规范化规则.您还可以基于一组已定义的值生成链.
生成多层状态表的代码是最有趣的一点.我本可以使用Data :: Diver,但我想自己解决这个问题.
单词规范化代码真的应该允许规范化器返回要处理的单词列表,而不仅仅是单个单词 - 但我不想现在修复它可以返回单词列表..其他的事情,如序列化你的处理语料库会很好,并且使用Getopt :: Long进行命令行开关仍然可以.我只做了有趣的比特.
在不使用对象的情况下编写此内容对我来说有点挑战 - 这真的是制作Markov生成器对象的好地方.我喜欢物体.但是,我决定保持代码程序,以便保留原始的精神.
玩得开心.
#!/usr/bin/perl
use strict;
use warnings;
use IO::Handle;
use constant NONWORD => "-";
my $MAXGEN = 10000;
my $DEPTH = 2;
my %state_table;
process_corpus( \*ARGV, $DEPTH, \%state_table );
generate_markov_chain( \%state_table, $MAXGEN );
sub process_corpus {
my $fh = shift;
my $depth = shift;
my $state_table = shift || {};;
my @history = (NONWORD) x $depth;
while( my $raw_line = $fh->getline ) {
my $line = normalize_line($raw_line);
next unless defined $line;
my @words = map normalize_word($_), split /\s+/, $line;
for my $word ( @words ) {
next unless defined $word;
add_word_to_table( $state_table, \@history, $word );
push @history, $word;
shift @history;
}
}
add_word_to_table( $state_table, \@history, NONWORD );
return $state_table;
}
# This was the trickiest to write.
# $node has to be a reference to the slot so that
# autovivified items will be retained in the $table.
sub add_word_to_table {
my $table = shift;
my $history = shift;
my $word = shift;
my $node = \$table;
for( @$history ) {
$node = \${$node}->{$_};
}
push @$$node, $word;
return 1;
}
# Replace this with anything.
# Return undef to skip a word
sub normalize_word {
my $word = shift;
$word =~ s/[^A-Z]//g;
return length $word ? $word : ();
}
# Replace this with anything.
# Return undef to skip a line
sub normalize_line {
return uc shift;
}
sub generate_markov_chain {
my $table = shift;
my $length = shift;
my $history = shift || [];
my $node = $table;
unless( @$history ) {
while(
ref $node eq ref {}
and
exists $node->{NONWORD()}
) {
$node = $node->{NONWORD()};
push @$history, NONWORD;
}
}
for (my $i = 0; $i < $MAXGEN; $i++) {
my $word = get_word( $table, $history );
last if $word eq NONWORD;
print "$word\n";
push @$history, $word;
shift @$history;
}
return $history;
}
sub get_word {
my $table = shift;
my $history = shift;
for my $step ( @$history ) {
$table = $table->{$step};
}
my $word = $table->[ int rand @$table ];
return $word;
}
Run Code Online (Sandbox Code Playgroud)
更新:
我修复了上面的代码来处理从normalize_word()例程返回的多个单词.
要保持案例完整并将标点符号视为单词,请替换normalize_line()和normalize_word():
sub normalize_line {
return shift;
}
sub normalize_word {
my $word = shift;
# Sanitize words to only include letters and ?,.! marks
$word =~ s/[^A-Z?.,!]//gi;
# Break the word into multiple words as needed.
my @words = split /([.?,!])/, $word;
# return all non-zero length words.
return grep length, @words;
}
Run Code Online (Sandbox Code Playgroud)
另一个潜伏的大问题是我用作-NONWORD角色.如果要将连字符作为标点符号包含,则需要在第8行更改NONWORD常量定义.只需选择一个永远不能成为单词的内容.
| 归档时间: |
|
| 查看次数: |
2401 次 |
| 最近记录: |