使用pd.eval()在pandas中进行动态表达式评估

cs9*_*s95 45 python eval dataframe pandas

目标和动机

pd.eval并且eval是pandas API套件中功能强大但被低估的功能,它们的使用远未完全记录或理解.小心适量,eval并且engine可以极大地简化代码,提高性能,并成为创建动态工作流的强大工具.

这个规范QnA的目的是让用户更好地理解这些功能,讨论一些鲜为人知的功能,如何使用它们,以及如何最好地使用它们,以及清晰易懂的示例.这篇文章将讨论的两个主要议题是

  1. 了解parser,df2x在争论pd.eval,以及它们如何被用来计算表达式
  2. 了解之间的差eval,eval并且engine,当每个功能是合适的用于动态执行.

这篇文章不能替代文档(答案中的链接),所以请完成它!


我将以这样的方式构建一个问题,以便开启对所支持的各种功能的讨论parser.

给出两个DataFrame

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))

df1
   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1

df2
   A  B  C  D
0  5  9  8  9
1  4  3  0  3
2  5  0  2  3
3  8  1  3  3
4  3  7  0  1
Run Code Online (Sandbox Code Playgroud)

我想用一个或多个列执行算术运算df2.具体来说,我想移植以下代码:

x = 5
df2['D'] = df1['A'] + (df1['B'] * x) 
Run Code Online (Sandbox Code Playgroud)

...使用代码x.使用的原因pd.eval是我想自动化许多工作流程,因此动态创建它们对我很有用.

我想更好地理解evaleval决定如何最好地解决我的问题.我已经阅读了文档,但差异并不明确.

  1. 应该使用什么参数来确保我的代码以最高性能运行?
  2. 有没有办法将表达式的结果分配回去engine
  3. 另外,为了使事情变得更复杂,我如何parser在字符串表达式中作为参数传递?

cs9*_*s95 61

这个答案潜入通过所提供的各种特性和功能pd.eval,df.query以及df.eval.

设置
示例将涉及这些DataFrame(除非另有说明).

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
Run Code Online (Sandbox Code Playgroud)

pandas.eval - "失踪手册"

注意
正在讨论的三个功能中,pd.eval最重要的是.df.eval并在引擎盖下df.query打电话 pd.eval.行为和使用在三个功能中或多或少是一致的,稍后会突出显示一些轻微的语义变化.本节将介绍所有三个函数共有的功能 - 包括(但不限于)允许的语法,优先级规则关键字参数.

pd.eval可以评估可以包含变量和/或文字的算术表达式.这些表达式必须作为字符串传递.所以,要回答上述问题,你可以做到

x = 5
pd.eval("df1.A + (df1.B * x)")  
Run Code Online (Sandbox Code Playgroud)

有些事情需要注意:

  1. 整个表达式是一个字符串
  2. df1,df2并且x引用全局命名空间中的变量,这些变量在eval解析表达式时被拾取
  3. 使用属性访问器索引访问特定列.您也可以使用"df1['A'] + (df1['B'] * x)"相同的效果.

我将在解释target=...下面属性的部分中讨论重新分配的具体问题.但是现在,这里有更简单的有效操作示例pd.eval:

pd.eval("df1.A + df2.A")   # Valid, returns a pd.Series object
pd.eval("abs(df1) ** .5")  # Valid, returns a pd.DataFrame object
Run Code Online (Sandbox Code Playgroud)

...等等.条件表达式也以相同的方式受支持.以下语句都是有效的表达式,将由引擎进行评估.

pd.eval("df1 > df2")        
pd.eval("df1 > 5")    
pd.eval("df1 < df2 and df3 < df4")      
pd.eval("df1 in [1, 2, 3]")
pd.eval("1 < 2 < 3")
Run Code Online (Sandbox Code Playgroud)

可以在文档中找到详细说明所有支持的功能和语法的列表.综上所述,

  • 除left shift(<<)和right shift(>>)运算符之外的算术运算,例如df + 2 * pi / s ** 4 % 42- the_golden_ratio
  • 比较操作,包括链式比较,例如, 2 < df < df2
  • 布尔操作,例如,df < df2 and df3 < df4或者not df_bool listtuple文字,例如,[1, 2](1, 2)
  • 属性访问,例如, df.a
  • 下标表达式,例如, df[0]
  • 简单的变量评估,例如,pd.eval('df')(这不是很有用)
  • 数学函数:sin,cos,exp,log,expm1,log1p,sqrt,sinh,cosh,tanh,arcsin,arccos,arctan,arccosh,arcsinh,arctanh,abs和arctan2.

本文档的这一部分还指定了不受支持的语法规则,包括set/ dictliterals,if-else语句,循环和理解以及生成器表达式.

从列表中可以看出,您还可以传递涉及索引的表达式,例如

pd.eval('df1.A * (df1.index > 1)')
Run Code Online (Sandbox Code Playgroud)

解析器选择:parser=...论证

pd.eval在解析表达式字符串以生成语法树时支持两种不同的解析器选项:pandaspython.两者之间的主要区别在于略微不同的优先规则.

使用默认解析器pandas,重载的按位运算符&|使用pandas对象实现向量化的AND和OR运算将具有and与`或' 相同的运算符优先级.所以,

pd.eval("(df1 > df2) & (df3 < df4)")
Run Code Online (Sandbox Code Playgroud)

将是一样的

pd.eval("df1 > df2 & df3 < df4")
# pd.eval("df1 > df2 & df3 < df4", parser='pandas')
Run Code Online (Sandbox Code Playgroud)

而且也一样

pd.eval("df1 > df2 and df3 < df4")
Run Code Online (Sandbox Code Playgroud)

在这里,括号是必要的.为了做到这一点,传统上,parens将被要求覆盖按位运算符的更高优先级:

(df1 > df2) & (df3 < df4)
Run Code Online (Sandbox Code Playgroud)

没有它,我们最终会

df1 > df2 & df3 < df4

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Run Code Online (Sandbox Code Playgroud)

or如果要在评估字符串时保持与python的实际运算符优先级规则的一致性,请使用此选项.

pd.eval("(df1 > df2) & (df3 < df4)", parser='python')
Run Code Online (Sandbox Code Playgroud)

The other difference between the two types of parsers are the semantics of the parser='python' and == operators with list and tuple nodes, which have the similar semantics as != and in respectively, when using the not in parser. For example,

pd.eval("df1 == [1, 2, 3]")
Run Code Online (Sandbox Code Playgroud)

Is valid, and will run with the same semantics as

pd.eval("df1 in [1, 2, 3]")
Run Code Online (Sandbox Code Playgroud)

OTOH, 'pandas' will throw a pd.eval("df1 == [1, 2, 3]", parser='python') error.

Backend Selection: The NotImplementedError argument

There are two options - engine=... (the default) and numexpr. The python option uses the numexpr backend which is optimized for performance.

With numexpr backend, your expression is evaluated similar to just passing the expression to python's 'python' function. You have the flexibility of doing more inside expressions, such as string operations, for instance.

df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})
pd.eval('df.A.str.contains("ab")', engine='python')

0     True
1    False
2     True
Name: A, dtype: bool
Run Code Online (Sandbox Code Playgroud)

Unfortunately, this method offers no performance benefits over the eval engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so USE AT YOUR OWN RISK! It is generally not recommended to change this option to numexpr unless you know what you're doing.

'python' and local_dict arguments

Sometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to global_dict

For example,

pd.eval("df1 > thresh")

UndefinedVariableError: name 'thresh' is not defined
Run Code Online (Sandbox Code Playgroud)

This fails because local_dict is not defined. However, this works:

pd.eval("df1 > thresh", local_dict={'thresh': 10})
Run Code Online (Sandbox Code Playgroud)

This is useful when you have variables to supply from a dictionary. Alternatively, with the thresh engine, you could simply do this:

mydict = {'thresh': 5}
# Dictionary values with *string* keys cannot be accessed without 
# using the 'python' engine.
pd.eval('df1 > mydict["thresh"]', engine='python')
Run Code Online (Sandbox Code Playgroud)

But this is going to possibly be much slower than using the 'python' engine and passing a dictionary to 'numexpr' or local_dict. Hopefully, this should make a convincing argument for the use of these parameters.

The global_dict (+ target) argument, and Assignment Expressions

This is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of inplace to an object that implements pd.eval such as __getitem__s, and (you guessed it) DataFrames.

Consider the example in the question

x = 5
df2['D'] = df1['A'] + (df1['B'] * x)
Run Code Online (Sandbox Code Playgroud)

To assign a column "D" to dict, we do

pd.eval('D = df1.A + (df1.B * x)', target=df2)

   A  B  C   D
0  5  9  8   5
1  4  3  0  52
2  5  0  2  22
3  8  1  3  48
4  3  7  0  42
Run Code Online (Sandbox Code Playgroud)

This is not an in-place modification of df2 (but it can be... read on). Consider another example:

pd.eval('df1.A + df2.A')

0    10
1    11
2     7
3    16
4    10
dtype: int32
Run Code Online (Sandbox Code Playgroud)

If you wanted to (for example) assign this back to a DataFrame, you could use the df2 argument as follows:

df = pd.DataFrame(columns=list('FBGH'), index=df1.index)
df
     F    B    G    H
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

df = pd.eval('B = df1.A + df2.A', target=df)
# Similar to 
# df = df.assign(B=pd.eval('df1.A + df2.A'))

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN
Run Code Online (Sandbox Code Playgroud)

If you wanted to perform an in-place mutation on target, set df.

pd.eval('B = df1.A + df2.A', target=df, inplace=True)
# Similar to 
# df['B'] = pd.eval('df1.A + df2.A')

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN
Run Code Online (Sandbox Code Playgroud)

If inplace=True is set without a target, a inplace is raised.

While the ValueError argument is fun to play around with, you will seldom need to use it.

If you wanted to do this with target, you would use an expression involving an assignment:

df = df.eval("B = @df1.A + @df2.A")
# df.eval("B = @df1.A + @df2.A", inplace=True)
df

     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN
Run Code Online (Sandbox Code Playgroud)

Note
One of df.eval's unintended uses is parsing literal strings in a manner very similar to pd.eval:

pd.eval("[1, 2, 3]")
array([1, 2, 3], dtype=object)
Run Code Online (Sandbox Code Playgroud)

It can also parse nested lists with the ast.literal_eval engine:

pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')
[[1, 2, 3], [4, 5], [10]]
Run Code Online (Sandbox Code Playgroud)

And lists of strings:

pd.eval(["[1, 2, 3]", "[4, 5]", "[10]"], engine='python')
[[1, 2, 3], [4, 5], [10]]
Run Code Online (Sandbox Code Playgroud)

The problem, however, is for lists with length larger than 10:

pd.eval(["[1]"] * 100, engine='python') # Works
pd.eval(["[1]"] * 101, engine='python') 

AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'
Run Code Online (Sandbox Code Playgroud)

More information can this error, causes, fixes, and workarounds can be found here.


'python' - A Juxtaposition with DataFrame.eval

As mentioned above, pandas.eval calls df.eval under the hood. The v0.23 source code shows this:

def eval(self, expr, inplace=False, **kwargs):

    from pandas.core.computation.eval import eval as _eval

    inplace = validate_bool_kwarg(inplace, 'inplace')
    resolvers = kwargs.pop('resolvers', None)
    kwargs['level'] = kwargs.pop('level', 0) + 1
    if resolvers is None:
        index_resolvers = self._get_index_resolvers()
        resolvers = dict(self.iteritems()), index_resolvers
    if 'target' not in kwargs:
        kwargs['target'] = self
    kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
    return _eval(expr, inplace=inplace, **kwargs)
Run Code Online (Sandbox Code Playgroud)

pd.eval creates arguments, does a little validation, and passes the arguments on to eval.

For more, you can read on: when to use DataFrame.eval() versus pandas.eval() or python eval()

Usage Differences

Expressions with DataFrames v/s Series Expressions

For dynamic queries associated with entire DataFrames, you should prefer pd.eval. For example, there is no simple way to specify the equivalent of pd.eval when you call pd.eval("df1 + df2") or df1.eval.

Specifying Column Names

Another other major difference is how columns are accessed. For example, to add two columns "A" and "B" in df2.eval, you would call df1 with the following expression:

pd.eval("df1.A + df1.B")
Run Code Online (Sandbox Code Playgroud)

With df.eval, you need only supply the column names:

df1.eval("A + B")
Run Code Online (Sandbox Code Playgroud)

Since, within the context of pd.eval, it is clear that "A" and "B" refer to column names.

You can also refer to the index and columns using df1 (unless the index is named, in which case you would use the name).

df1.eval("A + index")
Run Code Online (Sandbox Code Playgroud)

Or, more generally, for any DataFrame with an index having 1 or more levels, you can refer to the kth level of the index in an expression using the variable "ilevel_k" which stands for "index at level k". IOW, the expression above can be written as index.

These rules also apply to df1.eval("A + ilevel_0").

Accessing Variables in Local/Global Namespace

Variables supplied inside expressions must be preceeded by the "@" symbol, to avoid confusion with column names.

A = 5
df1.eval("A > @A") 
Run Code Online (Sandbox Code Playgroud)

The same goes for query/

It goes without saying that your column names must follow the rules for valid identifier naming in python to be accessible inside query. See here for a list of rules on naming identifiers.

Multiline Queries and Assignment

A little known fact is that eval support multiline expressions that deal with assignment. For example, to create two new columns "E" and "F" in df1 based on some arithmetic operations on some columns, and a third column "G" based on the previously created "E" and "F", we can do

df1.eval("""
E = A + B
F = @df2.A + @df2.B
G = E >= F
""")

   A  B  C  D   E   F      G
0  5  0  3  3   5  14  False
1  7  9  3  5  16   7   True
2  2  4  7  6   6   5   True
3  8  8  1  6  16   9   True
4  7  7  8  1  14  10   True
Run Code Online (Sandbox Code Playgroud)

...Nifty! However, note that this is not supported by eval.


query v/s eval - Final Word

It helps to think of query as a function that uses df.query as a subroutine.

Typically, pd.eval (as the name suggests) is used to evaluate conditional expressions (i.e., expressions that result in True/False values) and return the rows corresponding to the query result. The result of the expression is then passed to True (in most cases) to return the rows that satisfy the expression. According to the documentation,

The result of the evaluation of this expression is first passed to loc and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed to DataFrame.loc.

This method uses the top-level DataFrame.__getitem__() function to evaluate the passed query.

In terms of similarity, pandas.eval() and query are both alike in how they access column names and variables.

This key difference between the two, as mentioned above is how they handle the expression result. This becomes obvious when you actually run an expression through these two functions. For example, consider

df1.A

0    5
1    7
2    2
3    8
4    7
Name: A, dtype: int32

df1.B

0    9
1    3
2    0
3    1
4    7
Name: B, dtype: int32
Run Code Online (Sandbox Code Playgroud)

To get all rows where "A" >= "B" in df.eval, we would use df1 like this:

m = df1.eval("A >= B")
m
0     True
1    False
2    False
3     True
4     True
dtype: bool
Run Code Online (Sandbox Code Playgroud)

eval represents the intermediate result generated by evaluating the expression "A >= B". We then use the mask to filter m:

df1[m]
# df1.loc[m]

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1
Run Code Online (Sandbox Code Playgroud)

However, with df1, the intermediate result "m" is directly passed to query, so with loc, you would simply need to do

df1.query("A >= B")

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1
Run Code Online (Sandbox Code Playgroud)

Performance wise, it is exactly the same.

df1_big = pd.concat([df1] * 100000, ignore_index=True)

%timeit df1_big[df1_big.eval("A >= B")]
%timeit df1_big.query("A >= B")

14.7 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.7 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Run Code Online (Sandbox Code Playgroud)

But the latter is more concise, and expresses the same operation in a single step.

Note that you can also do weird stuff with query like this (to, say, return all rows indexed by df1.index)

df1.query("index")
# Same as df1.loc[df1.index] # Pointless,... I know

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1
Run Code Online (Sandbox Code Playgroud)

But don't.

Bottom line: Please use query when querying or filtering rows based on a conditional expression.


Pou*_*del 6

已经很棒的教程,但请记住,在eval/query被其更简单的语法吸引到疯狂使用之前,如果您的数据集少于 15,000 行,它会出现严重的性能问题。

在这种情况下,只需使用df.loc[mask1, mask2].

参考:https : //pandas.pydata.org/pandas-docs/version/0.22/enhancingperf.html#enhancingperf-eval

在此处输入图片说明