如何从多个向量构造一个 numpy 数组，其数据按 id 对齐

Question

如何从多个向量构造一个 numpy 数组，其数据按 id 对齐

Tin*_*tin 2 python numpy machine-learning scikit-learn

我正在使用Python、numpy和scikit-learn。我有存储在 SQL 表中的键和值的数据。我将其作为元组列表检索为：[(id, value),...]。每个 id 在列表中只出现一次，元组按 id 升序排列。这个过程完成了几次，所以我有多个key: value对的列表。这样：

dataset = []
for sample in samples:
    listOfTuplePairs = getDataFromSQL(sample)    # get a [(id, value),...] list
    dataset.append(listOfTuplePairs)

Run Code Online (Sandbox Code Playgroud)

密钥可能在不同的样本中重复，并且每一行的长度可能不同。一个例子dataset可能是：

dataset = [[(1, 0.13), (2, 2.05)],
           [(2, 0.23), (4, 7.35), (5, 5.60)],
           [(2, 0.61), (3, 4.45)]]

Run Code Online (Sandbox Code Playgroud)

可以看出，每一行都是一个样本，并且一些id（在本例中为2）出现在多个样本中。

问题：我希望构建一个适合用 scikit-learn 处理的单个（可能是稀疏的）numpy 数组。与每个样本的特定键 (id) 相关的值应在同一“列”中对齐（如果这是正确的术语），以便上述示例的矩阵如下所示：

    ids =     1    2     3      4    5
          ------------------------------
dataset = [(0.13, 2.05, null, null, null),
           (null, 0.23, null, 7.35, 5.60),
           (null, 0.61, 4.45, null, null)]

Run Code Online (Sandbox Code Playgroud)

如您所见，我还希望从矩阵中去除 id（尽管我需要保留它们的列表，以便我知道矩阵中的值与什么相关。每个初始key: value对列表可能包含数千行，并且可能有几千个样本，因此结果矩阵可能非常大。请提供考虑速度（在 Python 的限制范围内）、内存效率和代码清晰度的答案。

非常非常感谢您的帮助。

Answer 1

Div*_*kar 5

这是一种基于 NumPy 的方法，用于创建coo_matrix关注内存效率的稀疏矩阵-

from scipy.sparse import coo_matrix

# Construct row IDs
lens = np.array([len(item) for item in dataset])
shifts_arr = np.zeros(lens.sum(),dtype=int)
shifts_arr[lens[:-1].cumsum()] = 1
row = shifts_arr.cumsum()

# Extract values from dataset into a NumPy array
arr = np.concatenate(dataset)

# Get the unique column IDs to be used for col-indexing into output array
col = np.unique(arr[:,0],return_inverse=True)[1]

# Determine the output shape
out_shp = (row.max()+1,col.max()+1)

# Finally create a sparse marix with the row,col indices and col-2 of arr
sp_out = coo_matrix((arr[:,1],(row,col)), shape=out_shp)

Run Code Online (Sandbox Code Playgroud)

请注意，如果IDs应该是输出数组中的列号，您可以用np.unique这样的东西替换它给我们这样的唯一 ID -

col = (arr[:,0]-1).astype(int)

Run Code Online (Sandbox Code Playgroud)

这应该会给我们带来很好的性能提升！

样品运行 -

In [264]: dataset = [[(1, 0.13), (2, 2.05)],
     ...:            [(2, 0.23), (4, 7.35), (5, 5.60)],
     ...:            [(2, 0.61), (3, 4.45)]]

In [265]: sp_out.todense() # Using .todense() to show output
Out[265]: 
matrix([[ 0.13,  2.05,  0.  ,  0.  ,  0.  ],
        [ 0.  ,  0.23,  0.  ,  7.35,  5.6 ],
        [ 0.  ,  0.61,  4.45,  0.  ,  0.  ]])

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，5 月前
查看次数：	602 次
最近记录：	9 年，5 月前