取系列中所有元素与python pandas中先前元素的区别

Mr_*_*s_D 5 python vectorization pandas

我有一个数据框,其中的排序值由ID标记,我想将ID的第一个元素的值与所有先前ID的最后一个元素的值之差。下面的代码做了我想要的:

import pandas as pd

a = 'a'; b = 'b'; c = 'c'
df = pd.DataFrame(data=[*zip([a, a, a, b, b, c, a], [1, 2, 3, 5, 6, 7, 8])],
                  columns=['id', 'value'])
print(df)
# # take the last value for a particular id
# last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
# print(last_value_for_id)
current_id = ''; prev_values = {};diffs = {}
for t in df.itertuples(index=False):
    prev_values[t.id] = t.value
    if current_id != t.id:
        current_id = t.id
    else: continue
    for k, v in prev_values.items():
        if k == current_id: continue
        diffs[(k, current_id)] = t.value - v
print(pd.DataFrame(data=diffs.values(), columns=['diff'], index=diffs.keys()))
Run Code Online (Sandbox Code Playgroud)

印刷品:

  id  value
0  a      1
1  a      2
2  a      3
3  b      5
4  b      6
5  c      7
6  a      8
     diff
a b     2
  c     4
b c     1
  a     2
c a     1
Run Code Online (Sandbox Code Playgroud)

我想以向量化的方式来做。我找到了一种获取最后一系列元素的方法,如下所示:

# take the last value for a particular id
last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
print(last_value_for_id)
Run Code Online (Sandbox Code Playgroud)

这给了我:

  id  value
2  a      3
4  b      6
5  c      7
Run Code Online (Sandbox Code Playgroud)

但找不到使用此方法以向量化方式获取差异的方法

Qua*_*ang 4

根据您拥有的 id 数量,这适用于数千个:

# enumerate ids, should be careful
ids = [a,b,c]
num_ids = len(ids)

# compute first and last
f = df.groupby('id').value.agg(['first','last'])

# lower triangle mask
mask = np.array([[i>=j for j in range(num_ids)] for i in range(num_ids)])

# compute diff of first and last, then mask 
diff = np.where(mask, None, f['first'][None,:] - f['last'][:,None])
diff = pd.DataFrame(diff,
                    index = ids,
                    columns = ids)
# stack
diff.stack()
Run Code Online (Sandbox Code Playgroud)

输出:

a  b    2
   c    4
b  c    1
dtype: object
Run Code Online (Sandbox Code Playgroud)

编辑更新数据:

对于更新的数据,如果我们可以创建表,方法是类似的f

# create blocks of consecutive id
blocks = df['id'].ne(df['id'].shift()).cumsum()

# groupby
groups = df.groupby(blocks)

# create first and last values
df['fv'] = groups.value.transform('first')
df['lv'] = groups.value.transform('last')

# the above f and ids 
# note the column name change
f = df[['id','fv', 'lv']].drop_duplicates()
ids = f['id'].values
num_ids = len(ids)
Run Code Online (Sandbox Code Playgroud)

输出:

a   b     2
    c     4
    a     5
b   c     1
    a     2
c   a     1
dtype: object
Run Code Online (Sandbox Code Playgroud)

如果你想更进一步并删除索引(a,a),那么,我很懒:D。