nltk新手,遇到条件频率问题

Aph*_*pha 1 python nltk

我是python和nltk的新手(我在2小时前开始).这是我被要求做的事情:

编写一个函数Ge​​tAmbigousWords(语料库,N),用于在语料库中查找超过N个观察到的标签的单词.此函数应返回ConditionalFreqDist对象,其中条件为单词,频率分布表示每个单词的标签频率.

这是我到目前为止所做的:

def GetAmbiguousWords(corpus, number):
conditional_frequency = ConditionalFreqDist()
word_tag_dict = defaultdict(set)       # Creates a dictionary of sets
for (word, tag) in corpus:
    word_tag_dict[word].add(tag)

for taggedWord in word_tag_dict:
    if ( len(word_tag_dict[taggedWord]) >= number ):
        condition = taggedWord
        conditional_frequency[condition] # do something, I don't know what to do

return conditional_frequency
Run Code Online (Sandbox Code Playgroud)

例如,函数应该如何表现:

GetAmbiguousWords(nltk.corpus.brown.tagged_words(categories='news'), 4)
Run Code Online (Sandbox Code Playgroud)

我想知道我是在正确的轨道还是完全关闭?特别是,我并不真正了解条件频率.

提前致谢.

Dir*_*irk 5

通过频率分布,您可以收集文本中出现的单词的频率:

text = "cow cat mouse cat tiger"

fDist = FreqDist(word_tokenize(text))

for word in fDist:
    print "Frequency of", word, fDist.freq(word)
Run Code Online (Sandbox Code Playgroud)

这将导致:

Frequency of tiger 0.2
Frequency of mouse 0.2
Frequency of cow 0.2
Frequency of cat 0.4
Run Code Online (Sandbox Code Playgroud)

现在,条件频率基本相同,但您添加了一个条件,您可以根据该条件对频率进行分组.例如按字长分组:

cfdist = ConditionalFreqDist()

for word in word_tokenize(text):
    condition = len(word)
    cfdist[condition][word] += 1

for condition in cfdist:
    for word in cfdist[condition]:
        print "Cond. frequency of", word, cfdist[condition].freq(word), "[condition is word length =", condition, "]"
Run Code Online (Sandbox Code Playgroud)

这将打印:

Cond. frequency of cow 0.333333333333 [condition is word length = 3 ]
Cond. frequency of cat 0.666666666667 [condition is word length = 3 ]
Cond. frequency of tiger 0.5 [condition is word length = 5 ]
Cond. frequency of mouse 0.5 [condition is word length = 5 ]
Run Code Online (Sandbox Code Playgroud)

希望有所帮助.