如何拆分多个连接的单词？

Question

如何拆分多个连接的单词？

我有一个1000个左右的数组,下面是一些例子:

wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector

Run Code Online (Sandbox Code Playgroud)

我希望能够将这些分成各自的词,如:

wicked weather
liquid weather
drive our trucks
go compact
slim projector

Run Code Online (Sandbox Code Playgroud)

我希望有一个正则表达式,我可以做到这一点.但是,既然没有边界可以停下来,也没有任何我可以关键的大写,我想,有些类型的字典引用可能是必要的吗？

我想它可以手工完成,但为什么 - 什么时候可以用代码完成!=)但这让我很难过.有任何想法吗？

Answer 1

Dar*_*con 75

该Viterbi算法的速度要快得多.它在上面的Dmitry答案中计算与递归搜索相同的分数,但是在O(n)时间内.(德米特里的搜索需要指数时间;维特比通过动态编程来完成.)

import re
from collections import Counter

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = Counter(words(open('big.txt').read()))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

Run Code Online (Sandbox Code Playgroud)

测试它:

>>> viterbi_segment('wickedweather')
(['wicked', 'weather'], 5.1518198982768158e-10)
>>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0])
'its easy for me to split long run together blocks'

Run Code Online (Sandbox Code Playgroud)

为了实用,您可能需要一些改进:

添加概率日志,不要乘以概率.这可以避免浮点下溢.
您的输入通常会使用不在您的语料库中的单词.必须将这些子字符串分配为非零概率作为单词,否则最终没有解决方案或不良解决方案.(对于上述指数搜索算法也是如此.)这个概率必须从语料库词的概率中抽走,并在所有其他候选词中合理分布:一般主题在统计语言模型中称为平滑.(但是你可以通过一些非常粗糙的黑客来逃避.)这是O(n)维特比算法打破搜索算法的地方,因为考虑非语料库词会炸掉分支因子.

英语单词列表在这里。https://raw.githubusercontent.com/dwyl/english-words/master/words.txt (3认同)
你是对的,它没有回溯.它确实产生具有最高总分(单词概率的乘积)的分段.我不清楚你遇到了什么具体的问题 - 你在字典中"徘徊"并且你"明白了"吗？单字频率的结果可以以显着更大的计算成本来改进,通过进入二阶模型,您可以看到条件概率:P(word | preceding_word).请参阅http://norvig.com/ngrams/(我应该通过链接更新答案). (2认同)

Answer 2

Squ*_*Cog 33

人类能做到吗？

farsidebag
far sidebag
farside bag
far side bag

你不仅需要使用字典,你可能必须使用统计方法来找出最有可能的(或者,上帝禁止,选择人类语言的真实HMM ......)

对于如何进行可能有用的统计数据,我转向Peter Norvig博士,他解决了21行代码中不同但相关的拼写检查问题:http: //norvig.com/spell-correct.html

(他通过将每个for循环折叠成一行来做作弊...但仍然).

更新这一直困在我的脑海里,所以今天我不得不生下它.此代码与Robert Gamble描述的代码类似,但它会根据提供的字典文件中的字频率对结果进行排序(现在预计会出现一些代表您的域名或一般英语的文本.我使用了大字典.来自Norvig的.txt,与上面相关联,并在其上写了一本词典,以涵盖遗漏的单词).

除非频率差异很大,否则大多数时间的两个单词的组合将击败3个单词的组合.

我在我的博客上发布了一些微小的更改代码

http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/ 并且还写了一些关于此代码中的下溢错误.我很想安静地解决它,但想到这可能会帮助一些以前没有看过日志技巧的人:http: //squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/

输出你的单词,加上我自己的一些 - 注意"orcore"会发生什么:

perl splitwords.pl big.txt words
answerveal: 2 possibilities
 -  answer veal
 -  answer ve al

wickedweather: 4 possibilities
 -  wicked weather
 -  wicked we at her
 -  wick ed weather
 -  wick ed we at her

liquidweather: 6 possibilities
 -  liquid weather
 -  liquid we at her
 -  li quid weather
 -  li quid we at her
 -  li qu id weather
 -  li qu id we at her

driveourtrucks: 1 possibilities
 -  drive our trucks

gocompact: 1 possibilities
 -  go compact

slimprojector: 2 possibilities
 -  slim projector
 -  slim project or

orcore: 3 possibilities
 -  or core
 -  or co re
 -  orc ore

码:

#!/usr/bin/env perl

use strict;
use warnings;

sub find_matches($);
sub find_matches_rec($\@\@);
sub find_word_seq_score(@);
sub get_word_stats($);
sub print_results($@);
sub Usage();

our(%DICT,$TOTAL);
{
  my( $dict_file, $word_file ) = @ARGV;
  ($dict_file && $word_file) or die(Usage);

  {
    my $DICT;
    ($DICT, $TOTAL) = get_word_stats($dict_file);
    %DICT = %$DICT;
  }

  {
    open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n";

    foreach my $word (<$WORDS>) {
      chomp $word;
      my $arr = find_matches($word);


      local $_;
      # Schwartzian Transform
      my @sorted_arr =
        map  { $_->[0] }
        sort { $b->[1] <=> $a->[1] }
        map  {
          [ $_, find_word_seq_score(@$_) ]
        }
        @$arr;


      print_results( $word, @sorted_arr );
    }

    close $WORDS;
  }
}


sub find_matches($){
    my( $string ) = @_;

    my @found_parses;
    my @words;
    find_matches_rec( $string, @words, @found_parses );

    return  @found_parses if wantarray;
    return \@found_parses;
}

sub find_matches_rec($\@\@){
    my( $string, $words_sofar, $found_parses ) = @_;
    my $length = length $string;

    unless( $length ){
      push @$found_parses, $words_sofar;

      return @$found_parses if wantarray;
      return  $found_parses;
    }

    foreach my $i ( 2..$length ){
      my $prefix = substr($string, 0, $i);
      my $suffix = substr($string, $i, $length-$i);

      if( exists $DICT{$prefix} ){
        my @words = ( @$words_sofar, $prefix );
        find_matches_rec( $suffix, @words, @$found_parses );
      }
    }

    return @$found_parses if wantarray;
    return  $found_parses;
}


## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score(@){
    my( @words ) = @_;
    local $_;

    my $score = 1;
    foreach ( @words ){
        $score = $score * $DICT{$_} / $TOTAL;
    }

    return $score;
}

sub get_word_stats($){
    my ($filename) = @_;

    open(my $DICT, '<', $filename) or die "unable to open $filename\n";

    local $/= undef;
    local $_;
    my %dict;
    my $total = 0;

    while ( <$DICT> ){
      foreach ( split(/\b/, $_) ) {
        $dict{$_} += 1;
        $total++;
      }
    }

    close $DICT;

    return (\%dict, $total);
}

sub print_results($@){
    #( 'word', [qw'test one'], [qw'test two'], ... )
    my ($word,  @combos) = @_;
    local $_;
    my $possible = scalar @combos;

    print "$word: $possible possibilities\n";
    foreach (@combos) {
      print ' -  ', join(' ', @$_), "\n";
    }
    print "\n";
}

sub Usage(){
    return "$0 /path/to/dictionary /path/to/your_words";
}

Run Code Online (Sandbox Code Playgroud)

这个代码显然存在一个缺陷,它只是提出了"专家性别变化"作为第二个最有可能的结果. (4认同)

Answer 3

Rob*_*ble 8

这里工作的最佳工具是递归,而不是正则表达式.基本思路是从字符串的开头开始寻找一个单词,然后取出字符串的其余部分并查找另一个单词,依此类推,直到到达字符串的末尾.递归解决方案很自然,因为当字符串的给定余数不能被分成一组单词时,需要进行回溯.下面的解决方案使用字典来确定什么是单词并在找到它们时打印出解决方案(一些字符串可以分解成多个可能的单词集,例如wickedweather可以被解析为"邪恶的我们在她身上").如果您只想要一组单词,则需要确定选择最佳组的规则,

#!/usr/bin/perl

use strict;

my $WORD_FILE = '/usr/share/dict/words'; #Change as needed
my %words; # Hash of words in dictionary

# Open dictionary, load words into hash
open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n";
while (<WORDS>) {
  chomp;
  $words{lc($_)} = 1;
}
close(WORDS);

# Read one line at a time from stdin, break into words
while (<>) {
  chomp;
  my @words;
  find_words(lc($_));
}

sub find_words {
  # Print every way $string can be parsed into whole words
  my $string = shift;
  my @words = @_;
  my $length = length $string;

  foreach my $i ( 1 .. $length ) {
    my $word = substr $string, 0, $i;
    my $remainder = substr $string, $i, $length - $i;
    # Some dictionaries contain each letter as a word
    next if ($i == 1 && ($word ne "a" && $word ne "i"));

    if (defined($words{$word})) {
      push @words, $word;
      if ($remainder eq "") {
        print join(' ', @words), "\n";
        return;
      } else {
        find_words($remainder, @words);
      }
      pop @words;
    }
  }

  return;
}

Run Code Online (Sandbox Code Playgroud)

Answer 4

kam*_*sar 8

pip 安装 wordninja

>>> import wordninja
>>> wordninja.split('bettergood')
['better', 'good']

Run Code Online (Sandbox Code Playgroud)

最好提及产生上述 wordninja 包的答案：/sf/answers/814988121/ (6认同)

归档时间：	17 年，3 月前
查看次数：	19125 次
最近记录：	6 年，4 月前