Daw*_*wei 5 python sorting dataframe pandas
在问题的最后更新了一种更简单的方法.
我有一个用户 - 用户相关矩阵matrixcorr_of_user,如下所示:
userId 316 320 359 370 910
userId
316 1.000000 0.202133 0.208618 0.176050 0.174035
320 0.202133 1.000000 0.242837 0.019035 0.031737
359 0.208618 0.242837 1.000000 0.357620 0.175914
370 0.176050 0.019035 0.357620 1.000000 0.317371
910 0.174035 0.031737 0.175914 0.317371 1.000000
Run Code Online (Sandbox Code Playgroud)
对于每个用户,我只想保留与他最相似的其他 2 个用户(排除对角线元素后每行的最高相关值).像这样:
Out[40]:
userId 316 320 359 370 910
corr_user
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
Run Code Online (Sandbox Code Playgroud)
我知道如何实现它,但我提出的方式太复杂了.谁能提供更好的主意?
我首先melt是矩阵:
melted_corr = corr_of_user.reset_index().melt(id_vars ="userId",var_name="corr_user")
melted_corr.head()
Out[23]:
userId corr_user value
0 316 316 1.000000
1 320 316 0.202133
2 359 316 0.208618
3 370 316 0.176050
4 910 316 0.174035
Run Code Online (Sandbox Code Playgroud)
filter 它一行一行:
get_secend_third = lambda x : x.sort_values(ascending =False).iloc[1:3]
filted= melted_corr.set_index("userId").groupby("corr_user")["value"].apply(get_secend_third)
filted
Out[39]:
corr_user userId
316 359 0.208618
320 0.202133
320 359 0.242837
316 0.202133
359 370 0.357620
320 0.242837
370 359 0.357620
910 0.317371
910 370 0.317371
359 0.175914
Run Code Online (Sandbox Code Playgroud)
最后reshape它:
filted.reset_index().pivot_table("value","corr_user","userId")
Out[40]:
userId 316 320 359 370 910
corr_user
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
Run Code Online (Sandbox Code Playgroud)
看完@John Zwinck的答案后,我想出了一个更简单的方法
假设有一个新的矩阵,df有一些重复的值和NaN
userId 316 320 359 370 910
userId
316 1.0 0.500000 0.500000 0.500000 NaN
320 0.5 1.000000 0.242837 0.019035 0.031737
359 0.5 0.242837 1.000000 0.357620 0.175914
370 0.5 0.019035 0.357620 1.000000 0.317371
910 NaN 0.031737 0.175914 0.317371 1.000000
Run Code Online (Sandbox Code Playgroud)
起初我得到rank每一行.
rank = df.rank(1, ascending=False, method="first")
Run Code Online (Sandbox Code Playgroud)
然后我df.isin()用来得到我想要的面具.
mask = rank.isin(list(range(2,4)))
Run Code Online (Sandbox Code Playgroud)
最后
df.where(mask)
然后我想要我想要的.
userId 316 320 359 370 910
userId
316 NaN 0.5 0.500000 NaN NaN
320 0.5 NaN 0.242837 NaN NaN
359 0.5 NaN NaN 0.357620 NaN
370 0.5 NaN 0.357620 NaN NaN
910 NaN NaN 0.175914 0.317371 NaN
Run Code Online (Sandbox Code Playgroud)
首先,使用np.argsort()查找哪些位置具有最高值:
sort = np.argsort(df)
Run Code Online (Sandbox Code Playgroud)
这给出了一个 DataFrame,其列名称毫无意义,但右侧的第二列和第三列包含每行中所需的索引:
316 320 359 370 910
userId
316 4 3 1 2 0
320 3 4 0 2 1
359 4 0 1 3 2
370 1 0 4 2 3
910 1 0 2 3 4
Run Code Online (Sandbox Code Playgroud)
接下来,构造一个布尔掩码,在上述位置设置为 true:
mask = np.zeros(df.shape, bool)
rows = np.arange(len(df))
mask[rows, sort.iloc[:,-2]] = True
mask[rows, sort.iloc[:,-3]] = True
Run Code Online (Sandbox Code Playgroud)
现在你已经有了你需要的面具:
array([[False, True, True, False, False],
[ True, False, True, False, False],
[False, True, False, True, False],
[False, False, True, False, True],
[False, False, True, True, False]], dtype=bool)
Run Code Online (Sandbox Code Playgroud)
最后,df.where(mask):
316 320 359 370 910
userId
316 NaN 0.202133 0.208618 NaN NaN
320 0.202133 NaN 0.242837 NaN NaN
359 NaN 0.242837 NaN 0.357620 NaN
370 NaN NaN 0.357620 NaN 0.317371
910 NaN NaN 0.175914 0.317371 NaN
Run Code Online (Sandbox Code Playgroud)