我需要为> 10k文档中的> 10万个特征计算信息增益分数,以进行文本分类.下面的代码工作正常,但完整的数据集非常慢 - 在笔记本电脑上需要一个多小时.数据集是20newsgroup,我正在使用scikit-learn,chi2函数在scikit中提供的工作非常快.
知道如何为这样的数据集更快地计算信息增益吗?
def information_gain(x, y):
def _entropy(values):
counts = np.bincount(values)
probs = counts[np.nonzero(counts)] / float(len(values))
return - np.sum(probs * np.log(probs))
def _information_gain(feature, y):
feature_set_indices = np.nonzero(feature)[1]
feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices]
entropy_x_set = _entropy(y[feature_set_indices])
entropy_x_not_set = _entropy(y[feature_not_set_indices])
return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set)
+ ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set))
feature_size = x.shape[0]
feature_range = range(0, feature_size)
entropy_before = _entropy(y)
information_gain_scores …Run Code Online (Sandbox Code Playgroud) python performance machine-learning feature-selection scikit-learn