如何对具有偏移量的向量应用操作

Question

如何对具有偏移量的向量应用操作

考虑以下 pd.DataFrame

import numpy as np
import pandas as pd

start_end = pd.DataFrame([[(0, 3), (4, 5), (6, 12)], [(7, 10), (11, 90), (91, 99)]])
values = np.random.rand(1, 99)

Run Code Online (Sandbox Code Playgroud)

的start_end是一个pd.DataFrame形状(X, Y)，其中每个值是内部的元组(start_location, end_location)中的values矢量。另一种说法是特定单元格中的值是不同长度的向量。

题

如果我想找到中每个单元格的向量值的平均值（例如），pd.DataFrame我该如何以一种具有成本效益的方式做到这一点？

我设法用一个.apply函数实现了这一点，但速度很慢。

我想我需要找到某种方法将它呈现在numpy数组中，然后将其映射回 2d 数据框，但我不知道如何。

笔记

起点和终点之间的距离可能会有所不同，并且可能存在异常值。
单元格开始/结束始终与其他单元格不重叠（看看这个先决条件是否会影响求解速度会很有趣）。

泛化问题

更一般地说，我这是一个反复出现的问题，即如何制作 3d 数组，其中一个维度的长度与通过某些转换函数（平均值、最小值等）与 2d 矩阵的长度不相等。

Answer 1

Div*_*kar 5

前瞻性方法

查看您的示例数据：

In [64]: start_end
Out[64]: 
         0         1         2
0   (1, 6)    (4, 5)   (6, 12)
1  (7, 10)  (11, 12)  (13, 19)

Run Code Online (Sandbox Code Playgroud)

每行确实不重叠，但不是整个数据集。

现在，我们np.ufunc.reduceat为每个切片提供了 ufunc 缩减：

ufunc(ar[indices[i]: indices[i + 1]])

Run Code Online (Sandbox Code Playgroud)

只要indices[i] < indices[i+1].

所以，有了ufunc(ar, indices)，我们会得到：

[ufunc(ar[indices[0]: indices[1]]), ufunc(ar[indices[1]: indices[2]]), ..]

Run Code Online (Sandbox Code Playgroud)

在我们的例子中，对于每个元组(x,y)，我们知道x<y。对于堆叠版本，我们有：

[(x1,y1), (x2,y2), (x3,y3), ...]

Run Code Online (Sandbox Code Playgroud)

如果我们扁平化，它将是：

[x1,y1,x2,y2,x3,y3, ...]

Run Code Online (Sandbox Code Playgroud)

所以，我们可能没有y1<x2，但没关系，因为我们不需要 ufunc 减少对那个和类似的对 : y2,x3。但这没关系，因为可以通过最终输出的步长切片来跳过它们。

因此，我们将有：

# Inputs : a (1D array), start_end (2D array of shape (N,2))
lens = start_end[:,1]-start_end[:,0]
out = np.add.reduceat(a, start_end.ravel())[::2]/lens

Run Code Online (Sandbox Code Playgroud)

np.add.reduceat()部分为我们提供了切片求和。我们需要除以lens进行平均计算。

样品运行 -

In [47]: a
Out[47]: 
array([0.49264042, 0.00506412, 0.61419663, 0.77596769, 0.50721381,
       0.76943416, 0.83570173, 0.2085408 , 0.38992344, 0.64348176,
       0.3168665 , 0.78276451, 0.03779647, 0.33456905, 0.93971763,
       0.49663649, 0.4060438 , 0.8711461 , 0.27630025, 0.17129342])

In [48]: start_end
Out[48]: 
array([[ 1,  3],
       [ 4,  5],
       [ 6, 12],
       [ 7, 10],
       [11, 12],
       [13, 19]])

In [49]: [np.mean(a[i:j]) for (i,j) in start_end]
Out[49]: 
[0.30963037472653104,
 0.5072138121177008,
 0.5295464559328862,
 0.41398199978967815,
 0.7827645134019902,
 0.5540688880441684]

In [50]: lens = start_end[:,1]-start_end[:,0]
    ...: out = np.add.reduceat(a, start_end.ravel())[::2]/lens

In [51]: out
Out[51]: 
array([0.30963037, 0.50721381, 0.52954646, 0.413982  , 0.78276451,
       0.55406889])

Run Code Online (Sandbox Code Playgroud)

为了完整起见，参考给定的示例，转换步骤是：

# Given start_end as df and values as a 2D array
start_end = np.vstack(np.concatenate(start_end.values)) 
a = values.ravel()

Run Code Online (Sandbox Code Playgroud)

对于其他有reduceat方法的ufunc ，我们只需替换np.add.reduceat

@Newskooler 那么你有 `values = np.random.rand(1, 99)`。所以它不是一个向量，而是一个二维数组。我们需要一个一维数组进行处理，因此用“.ravel()”进行扁平化。如果需要向量（1D），您可以跳过 ravel。 (2认同)

归档时间：	5 年，7 月前
查看次数：	206 次
最近记录：	5 年，7 月前