使用.iterrows()和series.nlargest()来获取Dataframe中连续的最大数字

Dee*_*k M 4 python iterator dataframe pandas

我正在尝试创建一个使用df.iterrows()和的函数Series.nlargest.我想遍历每一行并找到最大的数字,然后将其标记为a 1.这是数据框:

A   B    C
9   6    5
3   7    2
Run Code Online (Sandbox Code Playgroud)

这是我希望的输出:

A    B   C
1    0   0
0    1   0
Run Code Online (Sandbox Code Playgroud)

这是我想在这里使用的功能:

def get_top_n(df, top_n):
    """


    Parameters
    ----------
    df : DataFrame

    top_n : int
        The top number to get
    Returns
    -------
    top_numbers : DataFrame
    Returns the top number marked with a 1

    """
    # Implement Function
    for row in df.iterrows():
        top_numbers = row.nlargest(top_n).sum()

    return top_numbers
Run Code Online (Sandbox Code Playgroud)

我收到以下错误:AttributeError:'tuple'对象没有属性'nlargest'

如何以更整洁的方式重新编写我的功能并实际工作,将不胜感激!提前致谢

jez*_*ael 6

添加i变量,因为每行的iterrows返回索引Series:

for i, row in df.iterrows():
    top_numbers = row.nlargest(top_n).sum()
Run Code Online (Sandbox Code Playgroud)

numpy.argsort对于位置降序的通用解决方案,然后比较并将布尔数组转换为整数:

def get_top_n(df, top_n):
    if top_n > len(df.columns):
        raise ValueError("Value is higher as number of columns")
    elif not isinstance(top_n, int):
        raise ValueError("Value is not integer")

    else:
        arr = ((-df.values).argsort(axis=1) < top_n).astype(int)
        df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
        return (df1)

df1 = get_top_n(df, 2)
print (df1)
   A  B  C
0  1  1  0
1  1  1  0

df1 = get_top_n(df, 1)
print (df1)
   A  B  C
0  1  0  0
1  0  1  0
Run Code Online (Sandbox Code Playgroud)

编辑:

解决方案iterrows是可能的,但不推荐,因为缓慢:

top_n = 2
for i, row in df.iterrows():
    top = row.nlargest(top_n).index
    df.loc[i] = 0
    df.loc[i, top] = 1

print (df)
   A  B  C
0  1  1  0
1  1  1  0
Run Code Online (Sandbox Code Playgroud)