Lar*_*Cai 8 python numpy dataframe pandas
它是pandas/Dataframe,对于每一行,我只想保留前N(N=3)个值并将其他值设置为nan,
import pandas as pd
import numpy as np
data = np.array([['','day1','day2','day3','day4','day5'],
['larry',1,4,4,3,5],
['gunnar',2,-1,3,4,4],
['tin',-2,5,5, 6,7]])
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
print(df)
Run Code Online (Sandbox Code Playgroud)
输出是
day1 day2 day3 day4 day5
larry 1 4 4 3 5
gunnar 2 -1 3 4 4
tin -2 5 5 6 7
Run Code Online (Sandbox Code Playgroud)
我想得到
day1 day2 day3 day4 day5
larry NaN 4 4 NaN 5
gunnar NaN NaN 3 4 4
tin NaN 5 NaN 6 7
Run Code Online (Sandbox Code Playgroud)
与熊猫类似:仅保留前 n 个值并将其他值设置为 0,但我只需要保留 N 个最高可用值,否则平均值不正确
对于上述结果我想先留着5只
您可以使用np.unique排序并找到第 5 个最大值,并使用where:
uniques = np.unique(df)
# what happens if len(uniques) < 5?
thresh = uniques[-5]
df.where(df >= thresh)
Run Code Online (Sandbox Code Playgroud)
输出:
day1 day2 day3 day4 day5
larry NaN 4.0 4 3 5
gunnar NaN NaN 3 4 4
tin NaN 5.0 5 6 7
Run Code Online (Sandbox Code Playgroud)
更新:第二次看,我认为你可以这样做:
df.apply(pd.Series.nlargest, n=3,axis=1).reindex(df.columns, axis=1)
Run Code Online (Sandbox Code Playgroud)
输出:
day1 day2 day3 day4 day5
larry NaN 4.0 4.0 NaN 5.0
gunnar NaN NaN 3.0 4.0 4.0
tin NaN 5.0 NaN 6.0 7.0
Run Code Online (Sandbox Code Playgroud)
df.rank这是使用on 的另一种方法axis=1,我们反转列并计算排名,因为在重复项上您希望保留第一个值。
df[df.astype(float).iloc[:,::-1].rank(1,'first').ge(3)]
Run Code Online (Sandbox Code Playgroud)
day1 day2 day3 day4 day5
larry NaN 4 4 NaN 5
gunnar NaN NaN 3 4 4
tin NaN 5 NaN 6 7
Run Code Online (Sandbox Code Playgroud)
然而,正如 @Allolz 正确指示的那样,对于基于 df 形状的一般用例,可以使用:
N=3
n = df.shape[1]-N+1
df[df.astype(float).iloc[:,::-1].rank(1,'first').ge(n)]
Run Code Online (Sandbox Code Playgroud)