确定特定术语的词频

Question

确定特定术语的词频

fds*_*yre 14 linux text analysis frequency word-frequency

我是一名非计算机科学专业的学生,正在撰写历史论文,涉及确定多个文本中特定术语的频率,然后随着时间的推移绘制这些频率以确定变化和趋势.虽然我已经想出如何确定给定文本文件的单词频率,但我正在处理(相对来说,对我来说)大量文件(> 100),并且为了一致性,我希望限制频率计数中包含的单词到一组特定的术语(有点像"停止列表"的反面)

这应该保持非常简单.最后,我需要的是我处理的每个文本文件的特定单词的频率,最好是电子表格格式(制表符描述文件),这样我就可以使用该数据创建图形和可视化.

我日常使用Linux,使用命令行很舒服,并且喜欢开源解决方案(或者我可以用WINE运行的东西).但这不是一个要求:

我看到两种解决这个问题的方法:

找到一种方法去除文本文件中的所有单词除了预定义列表,然后从那里进行频率计数,或者:
找到一种方法,仅使用预定义列表中的术语进行频率计数.

有任何想法吗？

Answer 1

Rob*_*ble 7

我会选择第二个想法.这是一个简单的Perl程序,它将从提供的第一个文件中读取单词列表,并从以制表符分隔格式提供的第二个文件中打印列表中每个单词的计数.应该每行提供第一个文件中的单词列表.

#!/usr/bin/perl

use strict;
use warnings;

my $word_list_file = shift;
my $process_file = shift;

my %word_counts;

# Open the word list file, read a line at a time, remove the newline,
# add it to the hash of words to track, initialize the count to zero
open(WORDS, $word_list_file) or die "Failed to open list file: $!\n";
while (<WORDS>) {
  chomp;
  # Store words in lowercase for case-insensitive match
  $word_counts{lc($_)} = 0;
}
close(WORDS);

# Read the text file one line at a time, break the text up into words
# based on word boundaries (\b), iterate through each word incrementing
# the word count in the word hash if the word is in the hash
open(FILE, $process_file) or die "Failed to open process file: $!\n";

while (<FILE>) {
  chomp;
  while ( /-$/ ) {
    # If the line ends in a hyphen, remove the hyphen and
    # continue reading lines until we find one that doesn't
    chop;
    my $next_line = <FILE>;
    defined($next_line) ? $_ .= $next_line : last;
  }

  my @words = split /\b/, lc; # Split the lower-cased version of the string
  foreach my $word (@words) {
    $word_counts{$word}++ if exists $word_counts{$word};
  }
}
close(FILE);

# Print each word in the hash in alphabetical order along with the
# number of time encountered, delimited by tabs (\t)
foreach my $word (sort keys %word_counts)
{
  print "$word\t$word_counts{$word}\n"
}

Run Code Online (Sandbox Code Playgroud)

如果文件words.txt包含:

linux
frequencies
science
words

Run Code Online (Sandbox Code Playgroud)

文件text.txt包含帖子的文本,以下命令:

perl analyze.pl words.txt text.txt

Run Code Online (Sandbox Code Playgroud)

将打印:

frequencies     3
linux   1
science 1
words   3

Run Code Online (Sandbox Code Playgroud)

请注意,使用\ b打破单词边界可能无法在所有情况下以您希望的方式工作,例如,如果您的文本文件包含跨行连字的单词,则需要执行更智能的操作以匹配这些单词.在这种情况下,您可以检查一行中的最后一个字符是否为连字符,如果是,只需删除连字符并读取另一行,然后再将该行拆分为单词.

编辑:更新版本,处理不区分大小写的单词并跨行处理带连字符的单词.

请注意,如果存在带连字符的单词,其中一些是跨行的,有些则不是,这将不会全部找到它们,因为它只删除了一行末尾的连字符.在这种情况下,您可能只想在删除连字符后删除所有连字符并匹配单词.您可以通过在split函数之前添加以下行来完成此操作:

s/-//g;

Run Code Online (Sandbox Code Playgroud)

归档时间：	17 年，2 月前
查看次数：	5181 次
最近记录：	8 年，7 月前