小编p_b*_*cia的帖子

快速信息增益计算

我需要为> 10k文档中的> 10万个特征计算信息增益分数,以进行文本分类.下面的代码工作正常,但完整的数据集非常慢 - 在笔记本电脑上需要一个多小时.数据集是20newsgroup,我正在使用scikit-learn,chi2函数在scikit中提供的工作非常快.

知道如何为这样的数据集更快地计算信息增益吗？

def information_gain(x, y):

    def _entropy(values):
        counts = np.bincount(values)
        probs = counts[np.nonzero(counts)] / float(len(values))
        return - np.sum(probs * np.log(probs))

    def _information_gain(feature, y):
        feature_set_indices = np.nonzero(feature)[1]
        feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices]
        entropy_x_set = _entropy(y[feature_set_indices])
        entropy_x_not_set = _entropy(y[feature_not_set_indices])

        return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set)
                                 + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set))

    feature_size = x.shape[0]
    feature_range = range(0, feature_size)
    entropy_before = _entropy(y)
    information_gain_scores …

Run Code Online (Sandbox Code Playgroud)

python performance machine-learning feature-selection scikit-learn

p_b*_*cia

2014 08-24

9
推荐指数

1
解决办法

1万
查看次数