将多个过滤器应用于pandas DataFrame或Series的有效方法

Question

将多个过滤器应用于pandas DataFrame或Series的有效方法

dur*_*2.0 121 python algorithm pandas

我有一个场景,用户想要将几个过滤器应用于Pandas DataFrame或Series对象.本质上,我想有效地将一堆过滤(比较操作)链接在一起,这些过滤由用户在运行时指定.

过滤器应该是添加剂(也就是应用的每个应该是狭窄的结果).

我目前正在使用,reindex()但每次创建一个新对象并复制基础数据(如果我正确理解文档).因此,在过滤大型系列或DataFrame时,这可能非常低效.

我认为使用apply(),map()或类似的东西可能会更好.我对Pandas很陌生,尽管如此仍然试图将我的头脑包裹起来.

TL; DR

我想获取以下表单的字典,并将每个操作应用于给定的Series对象并返回"已过滤"的Series对象.

relops = {'>=': [1], '<=': [1]}

Run Code Online (Sandbox Code Playgroud)

很长的例子

我将从一个当前的例子开始,只是过滤一个Series对象.以下是我目前使用的功能:

   def apply_relops(series, relops):
        """
        Pass dictionary of relational operators to perform on given series object
        """
        for op, vals in relops.iteritems():
            op_func = ops[op]
            for val in vals:
                filtered = op_func(series, val)
                series = series.reindex(series[filtered])
        return series

Run Code Online (Sandbox Code Playgroud)

用户提供包含他们想要执行的操作的字典:

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
   col1  col2
0     0    10
1     1    11
2     2    12

>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1       1
2       2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1       1
Name: col1

Run Code Online (Sandbox Code Playgroud)

再次,我的上述方法的"问题"是我认为对于中间步骤存在大量可能不必要的数据复制.

此外,我想扩展它,以便传入的字典可以包括操作符的列,并根据输入字典过滤整个DataFrame.但是,我假设系列的任何工作都可以轻松扩展为DataFrame.

Answer 1

And*_*den 205

Pandas(和numpy)允许布尔索引,这将更有效:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

Run Code Online (Sandbox Code Playgroud)

如果你想为此编写辅助函数,请考虑以下几点:

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

Run Code Online (Sandbox Code Playgroud)

更新:pandas 0.13有这种用例的查询方法,假设列名是有效的标识符,以下工作(对于大型帧,它可以更有效,因为它在幕后使用numexpr):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

Run Code Online (Sandbox Code Playgroud)

@dwanderson，您可以将多个条件的条件列表传递给np.logical_and.reduce。示例：np.logical_and.reduce（[df ['a'] == 3，df ['b']> 10，df ['c']。isin（1,3,5）]） (2认同)

Answer 2

Gec*_*cko 24

链条条件会产生长线,pep8不鼓励这样做.使用.query方法强制使用字符串,这种字符串功能强大但是非单调且不太动态.

一旦每个过滤器到位,一种方法就是

import numpy as np
import functools
def conjunction(*conditions):
    return functools.reduce(np.logical_and, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[conjunction(c1,c2,c3)]

Run Code Online (Sandbox Code Playgroud)

np.logical运行并且速度很快,但不会超过两个参数,由functools.reduce处理.

请注意,这仍有一些冗余:a)在全局级别上不会发生快捷方式b)每个条件都在整个初始数据上运行.尽管如此,我希望这对于许多应用程序来说足够有效,并且它非常易读.

我使用了： `df[f_2 & f_3 & f_4 & f_5 ]` 和 `f_2 = df["a"] >= 0` 等。不需要该函数...（虽然很好地使用了高阶函数...） (2认同)

Answer 3

Gil*_*gio 13

最简单的解决方案:

使用:

filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]

Run Code Online (Sandbox Code Playgroud)

另一个示例,要过滤属于Feb-2018的值的数据帧,请使用以下代码

filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]

Run Code Online (Sandbox Code Playgroud)

Answer 4

YOL*_*OLO 7

由于pandas 0.22更新，比较选项可用，例如：

gt（大于）
lt（小于）
eq（等于）
ne（不等于）
ge（大于或等于）

还有很多。这些函数返回布尔数组。让我们看看如何使用它们：

# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})

# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']

1    1
2    2
3    3
4    4
5    5

# where co11 values is better 0 and 2
df.loc[df['col1'].between(0,2)]

 col1 col2
0   0   10
1   1   11
2   2   12

# where col1 > 1
df.loc[df['col1'].gt(1)]

 col1 col2
2   2   12
3   3   13
4   4   14
5   5   15

Run Code Online (Sandbox Code Playgroud)

Answer 5

Obo*_*bol 5

为什么不这样做呢？

def filt_spec(df, col, val, op):
    import operator
    ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
    return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec

Run Code Online (Sandbox Code Playgroud)

演示：

df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')

Run Code Online (Sandbox Code Playgroud)

结果：

Run Code Online (Sandbox Code Playgroud)

您可以看到列“a”已被过滤，其中 a >=2。

这比操作符链接稍微快一些（打字时间，而不是性能）。您当然可以将导入放在文件的顶部。

归档时间：	13 年，3 月前
查看次数：	196116 次
最近记录：	7 年前