具有海量数据矩阵分解的推荐系统可提供MemoryError

Question

具有海量数据矩阵分解的推荐系统可提供MemoryError

Vla*_*lad 5 python recommendation-engine scipy pandas matrix-factorization

我有三个数据库模型（来自Django），可以用作构建推荐系统的输入：

用户列表-有userId，username，email等
电影列表-有movieId，movieTitle，Topics等
保存列表-使用userId，movieId和timestamp（当前推荐系统将比在线找到的通常方法更简单一点，因为没有评级得分，只是用户已经保存了某个电影，并且该模型包含了所有内容。电影保存）

我仍然应该能够使用矩阵分解（MF）来构建推荐系统，即使某个项目的等级只是1和0（已保存或未保存）的形式。

为了使用所有的MF算法，发现无论是scipy或surprise，我要创建一个pandas数据帧和数据透视这使得所有的用户id将是行（索引）和所有 movieIds将成为列。

用于执行此操作的代码段代码为：

# usersSet and moviesSet contain only ids of users or movies

zeros = numpy.zeros(shape=(len(usersSet), len(moviesSet)), dtype=numpy.int8)

saves_df = pandas.DataFrame(zeros, index=list(usersSet), columns=list(moviesSet))

for save in savesFromDb.iterator(chunk_size=50000):
    userId = save['user__id']
    movieId = save['movie__id']

    saves_df.at[userId, movieId] = 1

Run Code Online (Sandbox Code Playgroud)

到目前为止的问题：

使用DataFrame.locfrom pandas将值分配给多个列，而不是DataFrame.at给MemoryError。这就是为什么我选择后一种方法。
using svds from scipy for MF requires floats or doubles as the values of the DataFrame, and as soon as I do DataFrame.asfptype() I get a MemoryError

Questions:

Given that there are ~100k users, ~120k movies and ~450k saves, what's the best approach to model this in order to use recommendation algorithms but still not get MemoryError?
I also tried using DataFrame.pivot(), but is there a way to build it from 3 different DataFrames? i.e. indexes will be from list(usersSet), columns from list(moviesList) and values by iterating over savesFromDb and seeing where there is a userId -> movieId relationship and adding 1 in the pivot.
Aside from surprise's rating_scale parameter where you can define the rating (in my case would be (0, 1)), is there any other way in terms of algorithm approach or data model structure to leverage the fact that the rating in my case is only 1 or 0 (saved or not saved)?

Answer 1

Moh*_*san 2

如果可以选择使用稀疏矩阵和接受稀疏矩阵的算法，那么我强烈建议使用稀疏矩阵来消除内存问题。scipy.linalg.svds适用于 scipy 稀疏矩阵。

这是为您的案例创建稀疏矩阵的方法：

假设我们有 3 个用户（'a'、'b'、'c'）和 3 部电影（'aa'、'bb'、'cc'）。保存历史记录如下：

a救aa
b救了bb
c 保存 cc
A救BB

我们需要创建一个A_sparse，这样用户代表行，电影列，如果用户 i 保存了电影 j，则A[i, j] = 1csr_matrix

import numpy as np from scipy.sparse import csr_matrix # index users and movies by integers user2int = {u:i for i, u in enumerate(np.unique(users))} movies2int = {m:i for i, m in enumerate(np.unique(movies))} # get saved user list and corresponding movie lists saved_users = ["a", "b", "c", "a"] saved_movies = ["aa", "bb", "cc", "bb"] # get row and column indices where we need populate 1's usersidx = [user2int[u] for u in saved_users] moviesidx = [movies2int[m] for m in saved_movies] # Here, we only use binary flag for data. 1 for all saved instances. # Instead, one could also use something like count of saves etc. data = np.ones(len(saved_users), ) # create csr matrix A_sparse = csr_matrix((data, (usersidx, moviesidx))) print("Sparse array", A_sparse) #<3x3 sparse matrix of type '<class 'numpy.float64'>' # with 4 stored elements in Compressed Sparse Row format> print(A_sparse.data.nbytes) # 32 print("Dense array", A_sparse.A) #array([[1., 1., 0.], # [0., 1., 0.], # [0., 0., 1.]]) print(A_sparse.A.nbytes) # 72
Run Code Online (Sandbox Code Playgroud)
您可以注意到，由于我们的数据点有一半（大约）为零，因此稀疏矩阵大小几乎是 numpy ndarray 的一半。因此，内存压缩将按矩阵中零的百分比比例增加。

归档时间：	6 年，5 月前
查看次数：	129 次
最近记录：	6 年，5 月前