来自 sklearn.metrics.silhouette_samples 的 MemoryError

Kei*_*ith 1 python numpy cluster-analysis out-of-memory scikit-learn

尝试调用sklearn.metrics.silhouette_samples时出现内存错误。我的用例与本教程相同。我在 Python 3.5 中使用 scikit-learn 0.18.1。

对于相关的功能,silhouette_score,这篇文章建议使用sample_size参数,它可以在调用剪影_samples之前减少样本大小。我不确定下采样仍然会产生可靠的结果,所以我犹豫不决。

我的输入 X 是一个 [107545 行 x 12 列] 数据帧,虽然我只有 8gb 的 RAM,但我不会真正认为它很大

sklearn.metrics.silhouette_samples(X, labels, metric=’euclidean’)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-39-7285690e9ce8> in <module>()
----> 1 silhouette_samples(df_scaled, df['Cluster_Label'])

C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\cluster\unsupervised.py in silhouette_samples(X, labels, metric, **kwds)
    167     check_number_of_labels(len(le.classes_), X.shape[0])
    168 
--> 169     distances = pairwise_distances(X, metric=metric, **kwds)
    170     unique_labels = le.classes_
    171     n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1245         func = partial(distance.cdist, metric=metric, **kwds)
   1246 
-> 1247     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1248 
   1249 
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1088     if n_jobs == 1:
   1089         # Special case to avoid picklability checks in delayed
-> 1090         return func(X, Y, **kwds)
   1091 
   1092     # TODO: in some cases, backend='threading' may be appropriate
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
    244         YY = row_norms(Y, squared=True)[np.newaxis, :]
    245 
--> 246     distances = safe_sparse_dot(X, Y.T, dense_output=True)
    247     distances *= -2
    248     distances += XX
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 
MemoryError: 
Run Code Online (Sandbox Code Playgroud)

计算似乎依靠euclidean_distances坠毁上的呼叫np.dot。我不是在这里处理稀缺问题,所以也许没有解决方案。在计算距离时,我通常使用numpy.linalg.norm (AB)。这有更好的内存处理吗?

小智 7

更新:PR 11135应该在 scikit-learn 中解决这个问题,使帖子的其余部分过时。


你有大约 100000 = 1e5 个样本,它们是 12 维空间中的点。该pairwise_distances方法试图计算它们之间的所有成对距离。即 (1e5)**2 = 1e10 距离。每个都是一个浮点数;float64 格式需要 8 个字节的内存。所以距离矩阵的大小为 8e10 字节,即 74.5 GB。

这偶尔会在 GitHub 上报告:#4701#4197,答案大致如下:这是一个 NumPy 问题,它无法处理np.dot这种大小的矩阵。虽然有一条评论

有可能将其分解为子矩阵以进行更高效的计算。

事实上,如果该方法不是在开始时形成一个巨大的距离矩阵,而是在标签的循环中计算它的相关块,那将需要更少的内存。

使用其修改该方法并不难,因此不是先计算距离然后再应用二进制掩码,而是先进行掩码。这就是我在下面所做的。而不是N**2内存,其中 N 是样本数,它需要n**2其中 n 是最大集群大小。

如果这看起来很实用,我想可以通过一些标志将它添加到 Scikit 中……但应该注意,该版本不支持metric='precomputed'

import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels

def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
    X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
    le = LabelEncoder()
    labels = le.fit_transform(labels)
    check_number_of_labels(len(le.classes_), X.shape[0])

    unique_labels = le.classes_
    n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))

    # For sample i, store the mean distance of the cluster to which
    # it belongs in intra_clust_dists[i]
    intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)

    # For sample i, store the mean distance of the second closest
    # cluster in inter_clust_dists[i]
    inter_clust_dists = np.inf + intra_clust_dists

    for curr_label in range(len(unique_labels)):

        # Find inter_clust_dist for all samples belonging to the same
        # label.
        mask = labels == curr_label

        # Leave out current sample.
        n_samples_curr_lab = n_samples_per_label[curr_label] - 1
        if n_samples_curr_lab != 0:
            intra_distances = pairwise_distances(X[mask, :], metric=metric, **kwds)
            intra_clust_dists[mask] = np.sum(intra_distances, axis=1) / n_samples_curr_lab

        # Now iterate over all other labels, finding the mean
        # cluster distance that is closest to every sample.
        for other_label in range(len(unique_labels)):
            if other_label != curr_label:
                other_mask = labels == other_label
                inter_distances = pairwise_distances(X[mask, :], X[other_mask, :], metric=metric, **kwds)
                other_distances = np.mean(inter_distances, axis=1)
                inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)

    sil_samples = inter_clust_dists - intra_clust_dists
    sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
    # score 0 for clusters of size 1, according to the paper
    sil_samples[n_samples_per_label.take(labels) == 1] = 0
    return sil_samples
Run Code Online (Sandbox Code Playgroud)