用户、项目对的稀疏矩阵实现

Question

用户、项目对的稀疏矩阵实现

Key*_*i0r 2 numpy matrix scipy sparse-matrix python-2.7

我有一个包含数百万条记录的数据集，其中包含 100,000 个用户，他们购买了 20,000 件商品的子集，格式如下：

<user1, item1>
<user1, item12>
...
<user100,000, item>

Run Code Online (Sandbox Code Playgroud)

我需要跟踪一个大小（物品 x 用户）=（20,000 x 100,000）的矩阵，如果用户购买了物品，则为 1，否则为 0。目前我使用的是传统的 numpy 数组，但在后面的步骤中处理它需要很长时间。任何人都可以推荐一种使用 SciPy 稀疏矩阵之一的有效方法，它仍然允许基于索引搜索矩阵？

Answer 1

3kt*_*3kt 5

这是一个用 0 和 1 构建密集数据透视表的解决方案，然后创建等效的稀疏矩阵。我选择了lil_matrix，但存在其他选项。

import numpy as np
from scipy import sparse

ar = np.array([['user1', 'product1'], ['user2', 'product2'], ['user3', 'product3'], ['user3', 'product1']])

rows, r_pos = np.unique(ar[:,0], return_inverse=True)
cols, c_pos = np.unique(ar[:,1], return_inverse=True)

pivot_table = np.zeros((len(rows), len(cols)))
pivot_table[r_pos, c_pos] = 1

print(pivot_table)

# Convert the dense pivot table to a sparse matrix
s = sparse.lil_matrix(pivot_table)

# I can access the non-nul indices using nonzero
print(s.nonzero())

Run Code Online (Sandbox Code Playgroud)

这给出了：

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  1.]]
(array([0, 1, 2, 2], dtype=int32), array([0, 1, 0, 2], dtype=int32))

Run Code Online (Sandbox Code Playgroud)

附录

如果相关，这是另一个不使用 scipy 的解决方案，但是pandas：

In [34]: import pandas as pd

In [35]: df = pd.DataFrame([['user1', 'product1'], ['user2', 'product2'], ['user3', 'product3'], ['user3', 'product1']], columns = ['user', 'product'])

In [36]: df
Out[36]: 
    user   product
0  user1  product1
1  user2  product2
2  user3  product3
3  user3  product1

In [37]: df.groupby(['user', 'product']).size().unstack(fill_value=0)
Out[37]: 
product  product1  product2  product3
user                                 
user1           1         0         0
user2           0         1         0
user3           1         0         1

Run Code Online (Sandbox Code Playgroud)

另外，请不要认为这会计算每个客户购买的产品数量（这可能很有趣，取决于您的用例和数据集）。

您仍然可以使用此库搜索您的数据。

您可以直接使用 `s = sparse.coo_matrix((np.ones(r_pos.shape,int), (r_pos, c_pos)))` 制作稀疏矩阵。`s.nonzero()`（本质上）返回 `coo` 矩阵的 `row` 和 `col` 属性。 (3认同)

归档时间：	9 年，3 月前
查看次数：	1823 次
最近记录：	9 年，3 月前