Perl段落n-gram

Gle*_*rry 5 perl n-gram

假设我有一个文字句子:

$body = 'the quick brown fox jumps over the lazy dog';
Run Code Online (Sandbox Code Playgroud)

我想把这句话变成'关键词'的哈希值,但我想允许多词关键词; 我有以下内容来获取单字关键字:

$words{$_}++ for $body =~ m/(\w+)/g;
Run Code Online (Sandbox Code Playgroud)

完成后,我有一个如下所示的哈希:

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1
Run Code Online (Sandbox Code Playgroud)

下一步,以便我可以获得双字关键字,如下所示:

$words{$_}++ for $body =~ m/(\w+ \w+)/g;
Run Code Online (Sandbox Code Playgroud)

但这只会得到每一个"其他"对; 看起来像这样:

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1
Run Code Online (Sandbox Code Playgroud)

我还需要一个字偏移量:

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1
Run Code Online (Sandbox Code Playgroud)

有没有比以下更容易的方法?

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
Run Code Online (Sandbox Code Playgroud)

Grr*_*rrr 5

尽管所描述的任务可能对手动编码感兴趣,但使用处理n-gram的现有CPAN模块不是更好吗?看起来Text::Ngrams(相对于Text::Ngram)可以处理基于单词的n-gram分析.