计算两个文档之间的对称Kullback-Leibler分歧

Blu*_*482 11 python nlp information-retrieval similarity

我按照文件在这里和代码在这里的两个文本数据集之间计算KLD(它使用的是对称KLD,并在第一环节提出了一种back-off模型来实现).我最后更改了for循环以返回两个数据集的概率分布,以测试两个总和为1:

import re, math, collections

def tokenize(_str):
    stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', \
                 'are', 'will', 'in', 'it', 'to', 'that']
    tokens = collections.defaultdict(lambda: 0.)
    for m in re.finditer(r"(\w+)", _str, re.UNICODE):
        m = m.group(1).lower()
        if len(m) < 2: continue
        if m in stopwords: continue
        tokens[m] += 1

    return tokens
#end of tokenize

def kldiv(_s, _t):
    if (len(_s) == 0):
        return 1e33

    if (len(_t) == 0):
        return 1e33

    ssum = 0. + sum(_s.values())
    slen = len(_s)

    tsum = 0. + sum(_t.values())
    tlen = len(_t)

    vocabdiff = set(_s.keys()).difference(set(_t.keys()))
    lenvocabdiff = len(vocabdiff)

    """ epsilon """
    epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001

    """ gamma """
    gamma = 1 - lenvocabdiff * epsilon

    """ Check if distribution probabilities sum to 1"""
    sc = sum([v/ssum for v in _s.itervalues()])
    st = sum([v/tsum for v in _t.itervalues()])

    ps=[] 
    pt = [] 
    for t, v in _s.iteritems(): 
        pts = v / ssum 
        ptt = epsilon 
        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        ps.append(pts) 
        pt.append(ptt)
    return ps, pt
Run Code Online (Sandbox Code Playgroud)

我测试过

d1 = """Many research publications want you to use BibTeX, which better organizes the whole process. Suppose for concreteness your source file is x.tex. Basically, you create a file x.bib containing the bibliography, and run bibtex on that file.""" d2 = """In this case you must supply both a \left and a \right because the delimiter height are made to match whatever is contained between the two commands. But, the \left doesn't have to be an actual 'left delimiter', that is you can use '\left)' if there were some reason to do it."""

sum(ps)= 1但sum(pt)在小于1的情况下:

应该是这种情况.

代码中是否存在不正确的内容?谢谢!

更新:

为了使pt和ps总和为1,我不得不将代码更改为:

    vocab = Counter(_s)+Counter(_t)
    ps=[] 
    pt = [] 
    for t, v in vocab.iteritems(): 
        if t in _s:
            pts = gamma * (_s[t] / ssum) 
        else: 
            pts = epsilon

        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        else:
            ptt = epsilon

        ps.append(pts) 
        pt.append(ptt)

    return ps, pt
Run Code Online (Sandbox Code Playgroud)

Tom*_*oim 3

sum(ps) 和 sum(pt) 都是 _s 和 _t在 s 的支持上的总概率质量(“s 的支持”是指出现在 _s 中的所有单词,无论出现在 _t 中的单词)。这意味着

  1. sum(ps)==1,因为 for 循环对 _s 中的所有单词求和。
  2. sum(pt) <= 1,如果 t 的支持度是 s 支持度的子集(即,如果 _t 中的所有单词都出现在 _s 中),则等式成立。此外,如果 _s 和 _t 中的单词之间的重叠很小,则 sum(pt) 可能接近 0。具体来说,如果 _s 和 _t 的交集是空集,则 sum(pt) == epsilon*len(_s)。

所以,我不认为代码有问题。

另外,与问题标题相反,kldiv() 不计算对称 KL 散度,而是计算 _s 和 _t 的平滑版本之间的 KL 散度。