如何在 Pytorch 中使用 KNN、随机森林模型?

Prv*_*dav 1 python-3.x scikit-learn pytorch

这似乎是一个 XY 问题,但最初我有大量数据并且我无法在给定的资源中进行训练(RAM 问题)。所以我想我可以使用batch的功能Pytorch。但我想使用除深度学习之外的 KNN、随机森林、聚类等方法。那么有可能吗,或者我可以在 Pytorch 中使用 scikit 库吗?

Szy*_*zke 6

Update

Currently, there are some sklearn alternatives utilizing GPU, most prominent being cuML (link here) provided by rapidsai.

Previous answer

I would advise against using PyTorch solely for the purpose of using batches.

Argumentation goes as follows:

  1. scikit-learn has docs about scaling where one can find MiniBatchKMeans and there are other options like partial_fit method or warm_start arguments (as is the case with RandomForest, check this approach).
  2. KNN cannot be easily used without hand-made implementation with disk caching as it stores whole dataset in memory (and you lack RAM). This approach would be horribly inefficient either way, do not try.
  3. You most probably will not be able to create algorithms on-par with those from scikit (at least not solo and not without considerable amount of work). Your best bet is to go with quite battle-tested solutions (even though it's still 0.2x currently). It should be possible to get some speed improvements through numba but that's beside the scope of this question. Maybe you could utilize CUDA for different algorithms but it's even more non-trivial task.

总而言之,PyTorch它适用于大量使用 CUDA 的深度学习计算。如果你需要神经网络,这个框架是最好的框架之一,否则使用类似sklearn其他允许增量训练的框架。您可以随时与弥合这两个容易numpy()和其他几个电话pytorch

编辑:我发现 KNN 实现可能适合您在此 github 存储库中的要求


mod*_*itt 2

是的,这是可能的 - 但您必须自己实施它们。Pytorch 具有这些方法的原语,因为它实现了自己类型的张量等等;然而,该库只为深度学习方法提供一个抽象层。例如,一个非常简单的KNN实现(从当前点的向量距离生成的矩阵)将是

def KNN(X, k):
    X = X.float()
    mat_square = torch.mm(mat, mat.t())
    diag = torch.diagonal(mat_square)
    diag = diag.expand_as(mat_square)
    dist_mat = diag + diag.t() - 2*mat_square
    dist_col = dist_mat[-1, :-1]
    val, index = dist_col.topk(k, largest=False, sorted=True)
    return val, index
Run Code Online (Sandbox Code Playgroud)

scikit-learn如果您想要简单、开箱即用的解决方案,则应该使用。