Hon*_*ong 5 python indexing dataframe pandas
我有两个数据框,df 和 df2,它们是通讯员。现在基于第一个数据帧 df,我想在一行中获得 3 个最小值并返回对应列的名称(在这种情况下,如“X”或“Y”或“Z”或“T”)。所以我可以得到新的数据帧 df3。
df = pd.DataFrame({
'X': [21, 2, 43, 44, 56, 67, 7, 38, 29, 130],
'Y': [101, 220, 330, 140, 250, 10, 207, 320, 420, 50],
'Z': [20, 128, 136, 144, 312, 10, 82, 63, 42, 12],
'T': [2, 32, 4, 424, 256, 167, 27, 38, 229, 30]
}, index=list('ABCDEFGHIJ'))
df2 = pd.DataFrame({
'X': [0.5, 0.12,0.43, 0.424, 0.65,0.867,0.17,0.938,0.229,0.113],
'Y': [0.1,2.201,0.33,0.140,0.525,0.31,0.20,0.32,0.420,0.650],
'Z': [0.20,0.128,0.136,0.2144,0.5312,0.61,0.82,0.363,0.542,0.512],
'T':[0.52, 0.232,0.34, 0.6424, 0.6256,0.3167,0.527,0.38,0.4229,0.73]
},index=list('ABCDEFGHIJ'))
Run Code Online (Sandbox Code Playgroud)
除此之外,我想获得另一个数据帧 df4,它与 df2 中的 df3 对应,这意味着在 df row['A'] (2,20,21) 中是第 3 个最小值,所以在 df4 row['A'] 中,我想从 df2 得到 (0.52,0.2,0.5)。
DataFrames
如果两者具有相同的列名称且索引顺序相同,则可以使用argsort
:
arr = df.values.argsort(1)[:,:3]\nprint (arr)\n[[0 3 1]\n [1 0 3]\n [0 1 3]\n [1 2 3]\n [1 2 0]\n [2 3 1]\n [1 0 3]\n [0 1 3]\n [1 3 0]\n [3 0 2]]\n\n#get values by indices in arr \nb = df2.values[np.arange(len(arr))[:,None], arr]\nprint (b)\n[[ 0.52 0.2 0.5 ]\n [ 0.12 0.232 0.128 ]\n [ 0.34 0.43 0.136 ]\n [ 0.424 0.14 0.2144]\n [ 0.65 0.525 0.6256]\n [ 0.31 0.61 0.867 ]\n [ 0.17 0.527 0.82 ]\n [ 0.38 0.938 0.363 ]\n [ 0.229 0.542 0.4229]\n [ 0.512 0.73 0.65 ]]\n
Run Code Online (Sandbox Code Playgroud)\n\n最后使用DataFrame
构造函数:
df3 = pd.DataFrame(df.columns[arr])\ndf3.columns = [\'Col{}\'.format(x+1) for x in df3.columns]\nprint (df3)\n Col1 Col2 Col3\n0 T Z X\n1 X T Z\n2 T X Z\n3 X Y Z\n4 X Y T\n5 Y Z X\n6 X T Z\n7 T X Z\n8 X Z T\n9 Z T Y\n\ndf4 = pd.DataFrame(b)\ndf4.columns = [\'Col{}\'.format(x+1) for x in df4.columns]\nprint (df4)\n Col1 Col2 Col3\n0 0.520 0.200 0.5000\n1 0.120 0.232 0.1280\n2 0.340 0.430 0.1360\n3 0.424 0.140 0.2144\n4 0.650 0.525 0.6256\n5 0.310 0.610 0.8670\n6 0.170 0.527 0.8200\n7 0.380 0.938 0.3630\n8 0.229 0.542 0.4229\n9 0.512 0.730 0.6500\n
Run Code Online (Sandbox Code Playgroud)\n\n答案很相似,所以我创建了时间安排:
\n\nnp.random.seed(14)\nN = 1000000\ndf1 = pd.DataFrame(np.random.randint(100, size=(N, 4)), columns=[\'X\',\'Y\',\'Z\',\'T\'])\n#print (df1)\n\ndf1 = pd.DataFrame(np.random.rand(N, 4), columns=[\'X\',\'Y\',\'Z\',\'T\'])\n#print (df1)\n\n\ndef jez():\n arr = df.values.argsort(1)[:,:3]\n b = df2.values[np.arange(len(arr))[:,None], arr]\n df3 = pd.DataFrame(df.columns[arr])\n df3.columns = [\'Col{}\'.format(x+1) for x in df3.columns]\n df4 = pd.DataFrame(b)\n df4.columns = [\'Col{}\'.format(x+1) for x in df4.columns]\n\n\ndef pir():\n v = df.values\n a = v.argpartition(3, 1)[:, :3]\n c = df.columns.values[a]\n pd.DataFrame(c, df.index)\n d = df2.values[np.arange(len(df))[:, None], a]\n pd.DataFrame(d, df.index, [1, 2, 3]).add_prefix(\'Col\')\n\ndef c\xe1\xb4\x8f\xca\x9f\xe1\xb4\x85s\xe1\xb4\x98\xe1\xb4\x87\xe1\xb4\x87\xe1\xb4\x85():\n #another solution is wrong\n df3 = df.apply(lambda x: df.columns[np.argsort(x)], 1).iloc[:, :3]\n pd.DataFrame({\'Col{}\'.format(i + 1) : df2.lookup(df3.index, df3.iloc[:, i]) for i in range(df3.shape[1])}, index=df.index)\n\n\nprint (jez())\nprint (pir())\nprint (c\xe1\xb4\x8f\xca\x9f\xe1\xb4\x85s\xe1\xb4\x98\xe1\xb4\x87\xe1\xb4\x87\xe1\xb4\x85())\n
Run Code Online (Sandbox Code Playgroud)\n\nIn [176]: %timeit (jez())\n1000 loops, best of 3: 412 \xc2\xb5s per loop\n\nIn [177]: %timeit (pir())\n1000 loops, best of 3: 425 \xc2\xb5s per loop\n\nIn [178]: %timeit (c\xe1\xb4\x8f\xca\x9f\xe1\xb4\x85s\xe1\xb4\x98\xe1\xb4\x87\xe1\xb4\x87\xe1\xb4\x85())\n100 loops, best of 3: 3.99 ms per loop\n
Run Code Online (Sandbox Code Playgroud)\n
归档时间: |
|
查看次数: |
299 次 |
最近记录: |