如何解决稀疏矩阵的慢组？

Question

如何解决稀疏矩阵的慢组？

我有一个大矩阵（约 2 亿行）描述了每天发生的动作列表（有约 10000 个可能的动作）。我的最终目标是创建一个共现矩阵，显示在同一天发生的操作。

这是一个示例数据集：

data = {'date':   ['01', '01', '01', '02','02','03'],
        'action': [100, 101, 989855552, 100, 989855552, 777]}
df = pd.DataFrame(data, columns = ['date','action'])

Run Code Online (Sandbox Code Playgroud)

我尝试使用 pd.get_dummies 创建一个稀疏矩阵，但是解开矩阵并在其上使用 groupby 非常慢，仅 5000 行就需要 6 分钟。

# Create a sparse matrix of dummies
dum = pd.get_dummies(df['action'], sparse = True)
df = df.drop(['action'], axis = 1)
df = pd.concat([df, dum], axis = 1)

# Use groupby to get a single row for each date, showing whether each action occurred.
# The groupby command here is the bottleneck.
cols = list(df.columns)
del cols[0]
df = df.groupby('date')[cols].max()

# Create a co-occurrence matrix by using dot-product of sparse matrices
cooc = df.T.dot(df)

Run Code Online (Sandbox Code Playgroud)

我也试过：

以非稀疏格式获取假人；
使用 groupby 进行聚合；
在矩阵乘法之前去稀疏格式。

但是我在第 1 步中失败了，因为没有足够的 RAM 来创建如此大的矩阵。

我将不胜感激您的帮助。

Answer 1

Dud*_*ein 3

我根据这篇文章仅使用稀疏矩阵得出了一个答案。该代码速度很快，1000 万行大约需要 10 秒（我之前的代码需要 6 分钟处理 5000 行，并且不可扩展）。

时间和内存的节省来自于使用稀疏矩阵，直到最后一步，在导出之前需要解开（已经很小的）共现矩阵。

## Get unique values for date and action
date_c = CategoricalDtype(sorted(df.date.unique()), ordered=True)
action_c = CategoricalDtype(sorted(df.action.unique()), ordered=True)

## Add an auxiliary variable
df['count'] = 1

## Define a sparse matrix
row = df.date.astype(date_c).cat.codes
col = df.action.astype(action_c).cat.codes
sparse_matrix = csr_matrix((df['count'], (row, col)),
                shape=(date_c.categories.size, action_c.categories.size))

## Compute dot product with sparse matrix
cooc_sparse = sparse_matrix.T.dot(sparse_matrix)

## Unravel co-occurrence matrix into dense shape
cooc = pd.DataFrame(cooc_sparse.todense(), 
       index = action_c.categories, columns = action_c.categories)

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年前
查看次数：	228 次
最近记录：	6 年前