dmc*_*cer 57
实现此目的的一种方法是提取文档中出现频率高于您预期的文字.例如,在更大的文档集合中,几乎从未见过"马尔可夫"一词.然而,在来自同一集合的特定文档中,马尔科夫经常出现.这表明Markov可能是与文档关联的好关键字或标记.
要识别这样的关键字,您可以使用关键字和文档的逐点互信息.这是由PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ].这将粗略地告诉您,在特定文档中遇到这个术语时,你会感到多么惊讶(或者更多),因为它会在更大的集合中遇到它.
要确定与文档关联的5个最佳关键字,您只需按照PMI得分对文档进行排序,然后选择得分最高的5.
如果要提取多字标记,请参阅StackOverflow问题如何从一系列文本条目中提取常用/重要短语.
借用我对该问题的回答,NLTK如何在大约7行代码中使用n-gram PMI来解决如何提取有趣的多字表达式,例如:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)
Run Code Online (Sandbox Code Playgroud)
dou*_*oug 10
首先,计算语言学的关键python库是NLTK(" 自然语言工具包 ").这是一个由专业计算语言学家创建和维护的稳定,成熟的库.它也有一个广泛收集的教程,常见问题解答等.我高度推荐它.
下面是一个简单的模板,在python代码中,针对您的问题中提出的问题; 虽然它是一个运行的模板 - 提供任何文本作为字符串(如我所做),它将返回一个单词频率列表以及按"重要性"排序的那些单词的排序列表(或作为关键字的适用性) )根据一个非常简单的启发式方法.
给定文档的关键字(显然)从文档中的重要单词中选择 - 即那些可能将其与另一文档区分开的单词.如果您对该文本的主题没有先验知识,则常用技术是从其频率或重要性= 1 /频率推断给定词/术语的重要性或权重.
text = """ The intensity of the feeling makes up for the disproportion of the objects. Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love. When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """
BAD_CHARS = ".!?,\'\""
# transform text into a list words--removing punctuation and filtering small words
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]
word_freq = {}
# generate a 'word histogram' for the text--ie, a list of the frequencies of each word
for word in words :
word_freq[word] = word_freq.get(word, 0) + 1
# sort the word list by frequency
# (just a DSU sort, there's a python built-in for this, but i can't remember it)
tx = [ (v, k) for (k, v) in word_freq.items()]
tx.sort(reverse=True)
word_freq_sorted = [ (k, v) for (v, k) in tx ]
# eg, what are the most common words in that text?
print(word_freq_sorted)
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)]
# obviously using a text larger than 50 or so words will give you more meaningful results
term_importance = lambda word : 1.0/word_freq[word]
# select document keywords from the words at/near the top of this list:
map(term_importance, word_freq.keys())
Run Code Online (Sandbox Code Playgroud)
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation试图将训练语料库中的每个文档表示为主题的混合,而这些主题又是将单词映射到概率的分布。
我曾经用它来将产品评论语料库剖析为所有文档中都在谈论的潜在想法,例如“客户服务”、“产品可用性”等。基本模型不提倡转换的方法主题模型成一个词来描述一个主题是关于什么的……但是一旦他们的模型被训练,人们就会想出各种各样的启发式方法来做到这一点。
我建议您尝试使用http://mallet.cs.umass.edu/并查看此模型是否符合您的需求。
LDA 是一种完全无监督的算法,这意味着它不需要您手动注释任何很棒的东西,但另一方面,它可能无法为您提供您期望它提供的主题。
| 归档时间: |
|
| 查看次数: |
19468 次 |
| 最近记录: |