Dav*_*vis 11 bash perl awk nlp text-segmentation
希望在JasperServer中自动创建域.域是用于创建临时报告的数据的"视图".列的名称必须以人类可读的方式呈现给用户.
理论上,组织可以在报告中包含2,000多种可能的数据.数据来自非人类友好的名称,例如:
payperiodmatchcode labordistributioncodedesc依赖关系actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass beneficiaryclass beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnotperiperiod
你会如何自动将这些名称更改为:
使用谷歌你的意思是引擎,但我认为它违反了他们的服务条款:
lynx -dump «url» | grep "Did you mean" | awk ...
任何语言都可以,但像Perl这样的文本解析器可能非常适合.(列名仅限英文.)
打破单词的目标不是100%完美; 以下结果是可以接受的:
无论如何,人类都需要仔细检查结果并纠正许多结果.将一组2,000个结果减少到600次编辑将节省大量时间.要注意一些具有多种可能性的病例(例如,治疗师名称),要完全忽略这一点.
Sin*_*nür 14
有时,强制是可以接受的:
#!/usr/bin/perl
use strict; use warnings;
use File::Slurp;
my $dict_file = '/usr/share/dict/words';
my @identifiers = qw(
payperiodmatchcode labordistributioncodedesc dependentrelationship
actionendoption actionendoptiondesc addresstype addresstypedesc
historytype psaddresstype rolename bankaccountstatus
bankaccountstatusdesc bankaccounttype bankaccounttypedesc
beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
beneficiaryclass beneficiaryclassdesc benefitactioncode
benefitactioncodedesc benefitagecontrol benefitagecontroldesc
ageconrolagelimit ageconrolnoticeperiod
);
my @mydict = qw( desc );
my $pat = join('|',
map quotemeta,
sort { length $b <=> length $a || $a cmp $b }
grep { 2 < length }
(@mydict, map { chomp; $_ } read_file $dict_file)
);
my $re = qr/$pat/;
for my $identifier ( @identifiers ) {
my @stack;
print "$identifier : ";
while ( $identifier =~ s/($re)\z// ) {
unshift @stack, $1;
}
# mark suspicious cases
unshift @stack, '*', $identifier if length $identifier;
print "@stack\n";
}
Run Code Online (Sandbox Code Playgroud)
输出:
payperiodmatchcode : pay period match code labordistributioncodedesc : labor distribution code desc dependentrelationship : dependent relationship actionendoption : action end option actionendoptiondesc : action end option desc addresstype : address type addresstypedesc : address type desc historytype : history type psaddresstype : * ps address type rolename : role name bankaccountstatus : bank account status bankaccountstatusdesc : bank account status desc bankaccounttype : bank account type bankaccounttypedesc : bank account type desc beneficiaryamount : beneficiary amount beneficiaryclass : beneficiary class beneficiarypercent : beneficiary percent benefitsubclass : benefit subclass beneficiaryclass : beneficiary class beneficiaryclassdesc : beneficiary class desc benefitactioncode : benefit action code benefitactioncodedesc : benefit action code desc benefitagecontrol : benefit age control benefitagecontroldesc : benefit age control desc ageconrolagelimit : * ageconrol age limit ageconrolnoticeperiod : * ageconrol notice period
| 归档时间: |
|
| 查看次数: |
1881 次 |
| 最近记录: |