我经常需要一个新列,这是我可以从其他列中获得的最佳列,并且我有一个特定的首选项优先级列表.我愿意采取第一个非null值.
def coalesce(values):
not_none = (el for el in values if el is not None)
return next(not_none, None)
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
{'third':'B','first':None,'second':'C'},
{'third':'B','first':None,'second':None},
{'third':None,'first':None,'second':None},
{'third':'B','first':'A','second':None}])
df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df
Run Code Online (Sandbox Code Playgroud)
结果
first second third combo1 combo2
0 A C B A C
1 None C B C C
2 None None B B B
3 None None None None None
4 A None B A B
Run Code Online (Sandbox Code Playgroud)
这段代码有效(结果是我想要的),但速度不是很快.
如果我需要[['second','third','first']],我可以选择我的优先事项
Coalesce有点像tsql中同名的函数.
我怀疑我可能忽略了一种简单的方法来实现它,在大型DataFrame上具有良好的性能(+400,000行)
我知道有很多方法可以填补我经常在轴上使用的缺失数据= 0这就是让我觉得我可能错过了一个简单的选项,因为它= 1
你能建议一些更好/更快的东西......或者确认这是好的.
Bra*_*des 29
熊猫相当于COALESCE方法fillna():
result = column_a.fillna(column_b)
Run Code Online (Sandbox Code Playgroud)
结果是一列,column_a如果该列提供非空值,则从中获取每个值,否则从中获取值column_b.所以你combo1可以用:
df['first'].fillna(df['second']).fillna(df['third'])
Run Code Online (Sandbox Code Playgroud)
赠送:
0 A
1 C
2 B
3 None
4 A
Run Code Online (Sandbox Code Playgroud)
您combo2可以通过以下方式生产:
(df['second']).fillna(df['third']).fillna(df['first'])
Run Code Online (Sandbox Code Playgroud)
返回新列:
0 C
1 C
2 B
3 None
4 B
Run Code Online (Sandbox Code Playgroud)
如果你想要一个有效的操作调用coalesce,它可以简单地fillna()从左到右组合列,然后返回结果:
def coalesce(df, column_names):
i = iter(column_names)
column_name = next(i)
answer = df[column_name]
for column_name in i:
answer = answer.fillna(df[column_name])
return answer
print coalesce(df, ['first', 'second', 'third'])
print coalesce(df, ['second', 'third', 'first'])
Run Code Online (Sandbox Code Playgroud)
这使:
0 A
1 C
2 B
3 None
4 A
0 C
1 C
2 B
3 None
4 B
Run Code Online (Sandbox Code Playgroud)
您可以用来pd.isnull查找空值(在本例中)None:
In [169]: pd.isnull(df)\nOut[169]: \n first second third\n0 False False False\n1 True False False\n2 True True False\n3 True True True\n4 False True False\nRun Code Online (Sandbox Code Playgroud)\n\n然后使用np.argmin查找第一个非空值的索引。如果所有值都为 null,np.argmin则返回 0:
In [186]: np.argmin(pd.isnull(df).values, axis=1)\nOut[186]: array([0, 1, 2, 0, 0])\nRun Code Online (Sandbox Code Playgroud)\n\ndf然后您可以使用 NumPy 整数索引选择所需的值:
In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)]\nOut[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)\nRun Code Online (Sandbox Code Playgroud)\n\n例如,
\n\nimport pandas as pd\ndf = pd.DataFrame([{'third':'B','first':'A','second':'C'},\n {'third':'B','first':None,'second':'C'},\n {'third':'B','first':None,'second':None}, \n {'third':None,'first':None,'second':None},\n {'third':'B','first':'A','second':None}])\n\nmask = pd.isnull(df).values\ndf['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]\norder = np.array([1,2,0])\nmask = mask[:, order]\ndf['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]\nRun Code Online (Sandbox Code Playgroud)\n\n产量
\n\n first second third combo1 combo2\n0 A C B A C\n1 None C B C C\n2 None None B B B\n3 None None None None None\n4 A None B A B\nRun Code Online (Sandbox Code Playgroud)\n\ndf3.apply(coalesce, ...)如果 DataFrame 有很多行,则使用 argmin会明显更快:
df2 = pd.concat([df]*1000)\n\nIn [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)]\n1000 loops, best of 3: 617 \xc2\xb5s per loop\n\nIn [231]: %timeit df2.apply(coalesce, axis=1)\n10 loops, best of 3: 84.1 ms per loop\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
4305 次 |
| 最近记录: |