我有一个列、、、 和pandas DataFrame中的值,并且想要确定每一行的第一个和最后一个非零列。但所有行的元素顺序并不相同。它由、和列确定。ABCDitem_0item_1item_2
虽然我可以通过对每一行应用一个函数来轻松地做到这一点,但这对于我的DataFrame. 有没有一种优雅的、更Pythonic / pandasy 的方法来做到这一点?
输入:
A B C D item_0 item_1 item_2
0 1 2 0 0 A B C
1 0 1 1 0 A B C
2 1 0 1 0 A B C
3 0 2 0 0 D A B
4 1 1 0 1 D A B
5 0 0 0 1 D A B
Run Code Online (Sandbox Code Playgroud)
预期输出:
A B C D item_0 item_1 item_2 first last
0 1 2 0 0 A B C A B
1 0 1 1 0 A B C B C
2 1 0 1 0 A B C A C
3 0 2 0 0 D A B B B
4 1 1 0 1 D A B D B
5 0 0 0 1 D A B D D
Run Code Online (Sandbox Code Playgroud)
更新:这是当前的代码apply
import pandas as pd
def first_and_last_for_row(row):
reference_list = row[["item_0", "item_1", "item_2"]].tolist()
list_to_sort = (
row[["A", "B", "C", "D"]].index[row[["A", "B", "C", "D"]] > 0].tolist()
)
ordered_list = [l for l in reference_list if l in list_to_sort]
if len(ordered_list) == 0:
return None, None
else:
return ordered_list[0], ordered_list[-1]
df = pd.DataFrame(
{
"A": [1, 0, 1, 0, 1, 0],
"B": [2, 1, 0, 2, 1, 0],
"C": [0, 1, 1, 0, 0, 0],
"D": [0, 0, 0, 0, 1, 1],
"item_0": ["A", "A", "A", "D", "D", "D"],
"item_1": ["B", "B", "B", "A", "A", "A"],
"item_2": ["C", "C", "C", "B", "B", "B"],
}
)
df[["first", "last"]] = df.apply(first_and_last_for_row, axis=1, result_type="expand")
Run Code Online (Sandbox Code Playgroud)
这是一个完全矢量化的 numpy 方法。它不是很复杂,但有很多步骤,所以我还提供了代码的注释版本:
cols = ['A', 'B', 'C', 'D']
a = df[cols].to_numpy()
idx = df.filter(like='item_').replace({k:v for v,k in enumerate(cols)}).to_numpy()
b = a[np.arange(len(a))[:,None], idx] != 0
first = b.argmax(1)
last = b.shape[1]-np.fliplr(b).argmax(1)-1
c = df.filter(like='item_').to_numpy()
df[['first', 'last']] = c[np.arange(len(c))[:,None],
np.vstack((first, last)).T]
mask = b[np.arange(len(b)), first]
df[['first', 'last']] = df[['first', 'last']].where(pd.Series(mask, index=df.index))
Run Code Online (Sandbox Code Playgroud)
评论代码:
cols = ['A', 'B', 'C', 'D']
# convert to numpy array
a = df[cols].to_numpy()
# array([[1, 2, 0, 0],
# [0, 1, 1, 0],
# [1, 0, 1, 0],
# [0, 2, 0, 0],
# [1, 1, 0, 1],
# [0, 0, 0, 1]])
# get indexer as numpy array
idx = df.filter(like='item_').replace({k:v for v,k in enumerate(cols)}).to_numpy()
# array([[0, 1, 2],
# [0, 1, 2],
# [0, 1, 2],
# [3, 0, 1],
# [3, 0, 1],
# [3, 0, 1]])
# reorder columns and get non-zero
b = a[np.arange(len(a))[:,None], idx] != 0
# array([[ True, True, False],
# [False, True, True],
# [ True, False, True],
# [False, False, True],
# [ True, True, True],
# [ True, False, False]])
# first non-zero
first = b.argmax(1)
# array([0, 1, 0, 2, 0, 0])
# last non-zero
last = b.shape[1]-np.fliplr(b).argmax(1)-1
# array([1, 2, 2, 2, 2, 0])
# get back column names from position
c = df.filter(like='item_').to_numpy()
df[['first', 'last']] = c[np.arange(len(c))[:,None],
np.vstack((first, last)).T]
# optional
# define a mask in case a zero was selected
mask = b[np.arange(len(b)), first]
# array([ True, True, True, True, True, True])
# mask where argmax was 0
df[['first', 'last']] = df[['first', 'last']].where(pd.Series(mask, index=df.index))
Run Code Online (Sandbox Code Playgroud)
输出:
A B C D item_0 item_1 item_2 first last
0 1 2 0 0 A B C A B
1 0 1 1 0 A B C B C
2 1 0 1 0 A B C A C
3 0 2 0 0 D A B B B
4 1 1 0 1 D A B D B
5 0 0 0 1 D A B D D
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
661 次 |
| 最近记录: |