如何用空格分隔"句子"中的单词?

Dav*_*vis 11 bash perl awk nlp text-segmentation

背景

希望在JasperServer中自动创建域.域是用于创建临时报告的数据的"视图".列的名称必须以人类可读的方式呈现给用户.

问题

理论上,组织可以在报告中包含2,000多种可能的数据.数据来自非人类友好的名称,例如:

payperiodmatchcode labordistributioncodedesc依赖关系actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass beneficiaryclass beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnotperiperiod

你会如何自动将这些名称更改为:

  • 支付期间匹配代码
  • 劳务分配代码
  • 依赖关系

思路

  • 使用谷歌你的意思是引擎,但我认为它违反了他们的服务条款:

    lynx -dump «url» | grep "Did you mean" | awk ...

语言

任何语言都可以,但像Perl这样的文本解析器可能非常适合.(列名仅限英文.)

不必要的完美

打破单词的目标不是100%完美; 以下结果是可以接受的:

  • enrollmenteffectivedate - >报名生效日期
  • enrollmentenddate - >注册男士日期
  • enrollmentrequirementset - >注册要求集

无论如何,人类都需要仔细检查结果并纠正许多结果.将一组2,000个结果减少到600次编辑将节省大量时间.要注意一些具有多种可能性的病例(例如,治疗师名称),要完全忽略这一点.

Sin*_*nür 14

有时,强制是可以接受的:

#!/usr/bin/perl

use strict; use warnings;
use File::Slurp;

my $dict_file = '/usr/share/dict/words';

my @identifiers = qw(
    payperiodmatchcode labordistributioncodedesc dependentrelationship
    actionendoption actionendoptiondesc addresstype addresstypedesc
    historytype psaddresstype rolename bankaccountstatus
    bankaccountstatusdesc bankaccounttype bankaccounttypedesc
    beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
    beneficiaryclass beneficiaryclassdesc benefitactioncode
    benefitactioncodedesc benefitagecontrol benefitagecontroldesc
    ageconrolagelimit ageconrolnoticeperiod
);

my @mydict = qw( desc );

my $pat = join('|',
    map quotemeta,
    sort { length $b <=> length $a || $a cmp $b }
    grep { 2 < length }
    (@mydict, map { chomp; $_ } read_file $dict_file)
);

my $re = qr/$pat/;

for my $identifier ( @identifiers ) {
    my @stack;
    print "$identifier : ";
    while ( $identifier =~ s/($re)\z// ) {
        unshift @stack, $1;
    }
    # mark suspicious cases
    unshift @stack, '*', $identifier if length $identifier;
    print "@stack\n";
}
Run Code Online (Sandbox Code Playgroud)

输出:

payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : * ageconrol age limit
ageconrolnoticeperiod : * ageconrol notice period

另请参见Spellchecker曾经是软件工程的主要特色.