Prv*_*dav 1 python-3.x scikit-learn pytorch
这似乎是一个 XY 问题,但最初我有大量数据并且我无法在给定的资源中进行训练(RAM 问题)。所以我想我可以使用batch
的功能Pytorch
。但我想使用除深度学习之外的 KNN、随机森林、聚类等方法。那么有可能吗,或者我可以在 Pytorch 中使用 scikit 库吗?
Currently, there are some sklearn
alternatives utilizing GPU, most prominent being cuML
(link here) provided by rapidsai.
I would advise against using PyTorch
solely for the purpose of using batches.
Argumentation goes as follows:
scikit-learn
has docs about scaling where one can find MiniBatchKMeans and there are other options like partial_fit
method or warm_start
arguments (as is the case with RandomForest, check this approach).0.2x
currently). It should be possible to get some speed improvements through numba but that's beside the scope of this question. Maybe you could utilize CUDA for different algorithms but it's even more non-trivial task.总而言之,PyTorch
它适用于大量使用 CUDA 的深度学习计算。如果你需要神经网络,这个框架是最好的框架之一,否则使用类似sklearn
或其他允许增量训练的框架。您可以随时与弥合这两个容易numpy()
和其他几个电话pytorch
。
编辑:我发现 KNN 实现可能适合您在此 github 存储库中的要求
是的,这是可能的 - 但您必须自己实施它们。Pytorch 具有这些方法的原语,因为它实现了自己类型的张量等等;然而,该库只为深度学习方法提供一个抽象层。例如,一个非常简单的KNN
实现(从当前点的向量距离生成的矩阵)将是
def KNN(X, k):
X = X.float()
mat_square = torch.mm(mat, mat.t())
diag = torch.diagonal(mat_square)
diag = diag.expand_as(mat_square)
dist_mat = diag + diag.t() - 2*mat_square
dist_col = dist_mat[-1, :-1]
val, index = dist_col.topk(k, largest=False, sorted=True)
return val, index
Run Code Online (Sandbox Code Playgroud)
scikit-learn
如果您想要简单、开箱即用的解决方案,则应该使用。