根据/sf/answers/3428728411/,这是Python中加权基尼系数的实现:
import numpy as np
def gini(x, weights=None):
if weights is None:
weights = np.ones_like(x)
# Calculate mean absolute deviation in two steps, for weights.
count = np.multiply.outer(weights, weights)
mad = np.abs(np.subtract.outer(x, x) * count).sum() / count.sum()
rmad = mad / np.average(x, weights=weights)
# Gini equals half the relative mean absolute deviation.
return 0.5 * rmad
Run Code Online (Sandbox Code Playgroud)
这很干净,适用于中型阵列,但正如其初步建议(/sf/answers/2765965961/)所述,它是O(n 2).在我的计算机上,这意味着它在大约20k行之后中断:
n = 20000 # Works, 30000 fails.
gini(np.random.rand(n), np.random.rand(n))
Run Code Online (Sandbox Code Playgroud)
可以调整它以适用于更大的数据集吗?我的行是~150k行.
我想在 sklearn 包中,找到一类路径上每个特征的基尼系数,例如在虹膜数据中。如 Iris-virginica 花瓣长度 gini\xef\xbc\x9a0.4 \xef\xbc\x8c花瓣宽度 gini\xef\xbc\x9a0.4。
\n