Mr_*_*s_D 5 python vectorization pandas
我有一个数据框,其中的排序值由ID标记,我想将ID的第一个元素的值与所有先前ID的最后一个元素的值之差。下面的代码做了我想要的:
import pandas as pd
a = 'a'; b = 'b'; c = 'c'
df = pd.DataFrame(data=[*zip([a, a, a, b, b, c, a], [1, 2, 3, 5, 6, 7, 8])],
columns=['id', 'value'])
print(df)
# # take the last value for a particular id
# last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
# print(last_value_for_id)
current_id = ''; prev_values = {};diffs = {}
for t in df.itertuples(index=False):
prev_values[t.id] = t.value
if current_id != t.id:
current_id = t.id
else: continue
for k, v in prev_values.items():
if k == current_id: continue
diffs[(k, current_id)] = t.value - v
print(pd.DataFrame(data=diffs.values(), columns=['diff'], index=diffs.keys()))
Run Code Online (Sandbox Code Playgroud)
印刷品:
id value
0 a 1
1 a 2
2 a 3
3 b 5
4 b 6
5 c 7
6 a 8
diff
a b 2
c 4
b c 1
a 2
c a 1
Run Code Online (Sandbox Code Playgroud)
我想以向量化的方式来做。我找到了一种获取最后一系列元素的方法,如下所示:
# take the last value for a particular id
last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
print(last_value_for_id)
Run Code Online (Sandbox Code Playgroud)
这给了我:
id value
2 a 3
4 b 6
5 c 7
Run Code Online (Sandbox Code Playgroud)
但找不到使用此方法以向量化方式获取差异的方法
根据您拥有的 id 数量,这适用于数千个:
# enumerate ids, should be careful
ids = [a,b,c]
num_ids = len(ids)
# compute first and last
f = df.groupby('id').value.agg(['first','last'])
# lower triangle mask
mask = np.array([[i>=j for j in range(num_ids)] for i in range(num_ids)])
# compute diff of first and last, then mask
diff = np.where(mask, None, f['first'][None,:] - f['last'][:,None])
diff = pd.DataFrame(diff,
index = ids,
columns = ids)
# stack
diff.stack()
Run Code Online (Sandbox Code Playgroud)
输出:
a b 2
c 4
b c 1
dtype: object
Run Code Online (Sandbox Code Playgroud)
对于更新的数据,如果我们可以创建表,方法是类似的f:
# create blocks of consecutive id
blocks = df['id'].ne(df['id'].shift()).cumsum()
# groupby
groups = df.groupby(blocks)
# create first and last values
df['fv'] = groups.value.transform('first')
df['lv'] = groups.value.transform('last')
# the above f and ids
# note the column name change
f = df[['id','fv', 'lv']].drop_duplicates()
ids = f['id'].values
num_ids = len(ids)
Run Code Online (Sandbox Code Playgroud)
输出:
a b 2
c 4
a 5
b c 1
a 2
c a 1
dtype: object
Run Code Online (Sandbox Code Playgroud)
如果你想更进一步并删除索引(a,a),那么,我很懒:D。
| 归档时间: |
|
| 查看次数: |
87 次 |
| 最近记录: |