Emm*_*dec 17 python perl metadata
我想我可以拿一个文本并从中删除高频英语单词.通过关键字,我的意思是我想提取最能代表文本(标签)内容的单词.它不一定是完美的,一个很好的近似是完美的满足我的需求.
有人做过这样的事吗?你知道Perl或Python库吗?
Lingua :: EN :: Tagger正是我所要求的,但我需要一个可以用于法语文本的库.
您可以尝试使用perl模块Lingua :: EN :: Tagger,以获得快速简便的解决方案.
一个更复杂的模块Lingua :: EN :: Semtags :: Engine使用Lingua :: EN :: Tagger和WordNet数据库来获得更结构化的输出.两者都非常易于使用,只需查看CPAN上的文档或在安装模块后使用perldoc.
要查找文本中最常用的单词,请执行以下操作:
#!/usr/bin/perl -w
use strict;
use warnings 'all';
# Read the text:
open my $ifh, '<', 'text.txt'
or die "Cannot open file: $!";
local $/;
my $text = <$ifh>;
# Find all the words, and count how many times they appear:
my %words = ( );
map { $words{$_}++ }
grep { length > 1 && $_ =~ m/^[\@a-z-']+$/i }
map { s/[",\.]//g; $_ }
split /\s/, $text;
print "Words, sorted by frequency:\n";
my (@data_line);
format FMT =
@<<<<<<<<<<<<<<<<<<<<<<... @########
@data_line
.
local $~ = 'FMT';
# Sort them by frequency:
map { @data_line = ($_, $words{$_}); write(); }
sort { $words{$b} <=> $words{$a} }
grep { $words{$_} > 2 }
keys(%words);
Run Code Online (Sandbox Code Playgroud)
示例输出如下所示:
john@ubuntu-pc1:~/Desktop$ perl frequency.pl
Words, sorted by frequency:
for 32
Jan 27
am 26
of 21
your 21
to 18
in 17
the 17
Get 13
you 13
OTRS 11
today 11
PSM 10
Card 10
me 9
on 9
and 9
Offline 9
with 9
Invited 9
Black 8
get 8
Web 7
Starred 7
All 7
View 7
Obama 7
Run Code Online (Sandbox Code Playgroud)