cs9*_*s95 45 python eval dataframe pandas
pd.eval
并且eval
是pandas API套件中功能强大但被低估的功能,它们的使用远未完全记录或理解.小心适量,eval
并且engine
可以极大地简化代码,提高性能,并成为创建动态工作流的强大工具.
这个规范QnA的目的是让用户更好地理解这些功能,讨论一些鲜为人知的功能,如何使用它们,以及如何最好地使用它们,以及清晰易懂的示例.这篇文章将讨论的两个主要议题是
parser
,df2
并x
在争论pd.eval
,以及它们如何被用来计算表达式eval
,eval
并且engine
,当每个功能是合适的用于动态执行.这篇文章不能替代文档(答案中的链接),所以请完成它!
我将以这样的方式构建一个问题,以便开启对所支持的各种功能的讨论parser
.
给出两个DataFrame
np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df1
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
3 8 8 1 6
4 7 7 8 1
df2
A B C D
0 5 9 8 9
1 4 3 0 3
2 5 0 2 3
3 8 1 3 3
4 3 7 0 1
Run Code Online (Sandbox Code Playgroud)
我想用一个或多个列执行算术运算df2
.具体来说,我想移植以下代码:
x = 5
df2['D'] = df1['A'] + (df1['B'] * x)
Run Code Online (Sandbox Code Playgroud)
...使用代码x
.使用的原因pd.eval
是我想自动化许多工作流程,因此动态创建它们对我很有用.
我想更好地理解eval
和eval
决定如何最好地解决我的问题.我已经阅读了文档,但差异并不明确.
engine
?parser
在字符串表达式中作为参数传递?cs9*_*s95 61
这个答案潜入通过所提供的各种特性和功能pd.eval
,df.query
以及df.eval
.
设置
示例将涉及这些DataFrame(除非另有说明).
np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
Run Code Online (Sandbox Code Playgroud)
pandas.eval
- "失踪手册"注意
正在讨论的三个功能中,pd.eval
最重要的是.df.eval
并在引擎盖下df.query
打电话pd.eval
.行为和使用在三个功能中或多或少是一致的,稍后会突出显示一些轻微的语义变化.本节将介绍所有三个函数共有的功能 - 包括(但不限于)允许的语法,优先级规则和关键字参数.
pd.eval
可以评估可以包含变量和/或文字的算术表达式.这些表达式必须作为字符串传递.所以,要回答上述问题,你可以做到
x = 5
pd.eval("df1.A + (df1.B * x)")
Run Code Online (Sandbox Code Playgroud)
有些事情需要注意:
df1
,df2
并且x
引用全局命名空间中的变量,这些变量在eval
解析表达式时被拾取"df1['A'] + (df1['B'] * x)"
相同的效果.我将在解释target=...
下面属性的部分中讨论重新分配的具体问题.但是现在,这里有更简单的有效操作示例pd.eval
:
pd.eval("df1.A + df2.A") # Valid, returns a pd.Series object
pd.eval("abs(df1) ** .5") # Valid, returns a pd.DataFrame object
Run Code Online (Sandbox Code Playgroud)
...等等.条件表达式也以相同的方式受支持.以下语句都是有效的表达式,将由引擎进行评估.
pd.eval("df1 > df2")
pd.eval("df1 > 5")
pd.eval("df1 < df2 and df3 < df4")
pd.eval("df1 in [1, 2, 3]")
pd.eval("1 < 2 < 3")
Run Code Online (Sandbox Code Playgroud)
可以在文档中找到详细说明所有支持的功能和语法的列表.综上所述,
- 除left shift(
<<
)和right shift(>>
)运算符之外的算术运算,例如df + 2 * pi / s ** 4 % 42
- the_golden_ratio- 比较操作,包括链式比较,例如,
2 < df < df2
- 布尔操作,例如,
df < df2 and df3 < df4
或者not df_bool
list
和tuple
文字,例如,[1, 2]
或(1, 2)
- 属性访问,例如,
df.a
- 下标表达式,例如,
df[0]
- 简单的变量评估,例如,
pd.eval('df')
(这不是很有用)- 数学函数:sin,cos,exp,log,expm1,log1p,sqrt,sinh,cosh,tanh,arcsin,arccos,arctan,arccosh,arcsinh,arctanh,abs和arctan2.
本文档的这一部分还指定了不受支持的语法规则,包括set
/ dict
literals,if-else语句,循环和理解以及生成器表达式.
从列表中可以看出,您还可以传递涉及索引的表达式,例如
pd.eval('df1.A * (df1.index > 1)')
Run Code Online (Sandbox Code Playgroud)
parser=...
论证pd.eval
在解析表达式字符串以生成语法树时支持两种不同的解析器选项:pandas
和python
.两者之间的主要区别在于略微不同的优先规则.
使用默认解析器pandas
,重载的按位运算符&
和|
使用pandas对象实现向量化的AND和OR运算将具有and
与`或' 相同的运算符优先级.所以,
pd.eval("(df1 > df2) & (df3 < df4)")
Run Code Online (Sandbox Code Playgroud)
将是一样的
pd.eval("df1 > df2 & df3 < df4")
# pd.eval("df1 > df2 & df3 < df4", parser='pandas')
Run Code Online (Sandbox Code Playgroud)
而且也一样
pd.eval("df1 > df2 and df3 < df4")
Run Code Online (Sandbox Code Playgroud)
在这里,括号是必要的.为了做到这一点,传统上,parens将被要求覆盖按位运算符的更高优先级:
(df1 > df2) & (df3 < df4)
Run Code Online (Sandbox Code Playgroud)
没有它,我们最终会
df1 > df2 & df3 < df4
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Run Code Online (Sandbox Code Playgroud)
or
如果要在评估字符串时保持与python的实际运算符优先级规则的一致性,请使用此选项.
pd.eval("(df1 > df2) & (df3 < df4)", parser='python')
Run Code Online (Sandbox Code Playgroud)
The other difference between the two types of parsers are the semantics of the parser='python'
and ==
operators with list and tuple nodes, which have the similar semantics as !=
and in
respectively, when using the not in
parser. For example,
pd.eval("df1 == [1, 2, 3]")
Run Code Online (Sandbox Code Playgroud)
Is valid, and will run with the same semantics as
pd.eval("df1 in [1, 2, 3]")
Run Code Online (Sandbox Code Playgroud)
OTOH, 'pandas'
will throw a pd.eval("df1 == [1, 2, 3]", parser='python')
error.
NotImplementedError
argumentThere are two options - engine=...
(the default) and numexpr
. The python
option uses the numexpr backend which is optimized for performance.
With numexpr
backend, your expression is evaluated similar to just passing the expression to python's 'python'
function. You have the flexibility of doing more inside expressions, such as string operations, for instance.
df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})
pd.eval('df.A.str.contains("ab")', engine='python')
0 True
1 False
2 True
Name: A, dtype: bool
Run Code Online (Sandbox Code Playgroud)
Unfortunately, this method offers no performance benefits over the eval
engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so USE AT YOUR OWN RISK! It is generally not recommended to change this option to numexpr
unless you know what you're doing.
'python'
and local_dict
argumentsSometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to global_dict
For example,
pd.eval("df1 > thresh")
UndefinedVariableError: name 'thresh' is not defined
Run Code Online (Sandbox Code Playgroud)
This fails because local_dict
is not defined. However, this works:
pd.eval("df1 > thresh", local_dict={'thresh': 10})
Run Code Online (Sandbox Code Playgroud)
This is useful when you have variables to supply from a dictionary. Alternatively, with the thresh
engine, you could simply do this:
mydict = {'thresh': 5}
# Dictionary values with *string* keys cannot be accessed without
# using the 'python' engine.
pd.eval('df1 > mydict["thresh"]', engine='python')
Run Code Online (Sandbox Code Playgroud)
But this is going to possibly be much slower than using the 'python'
engine and passing a dictionary to 'numexpr'
or local_dict
. Hopefully, this should make a convincing argument for the use of these parameters.
global_dict
(+ target
) argument, and Assignment ExpressionsThis is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of inplace
to an object that implements pd.eval
such as __getitem__
s, and (you guessed it) DataFrames.
Consider the example in the question
Run Code Online (Sandbox Code Playgroud)x = 5 df2['D'] = df1['A'] + (df1['B'] * x)
To assign a column "D" to dict
, we do
pd.eval('D = df1.A + (df1.B * x)', target=df2)
A B C D
0 5 9 8 5
1 4 3 0 52
2 5 0 2 22
3 8 1 3 48
4 3 7 0 42
Run Code Online (Sandbox Code Playgroud)
This is not an in-place modification of df2
(but it can be... read on). Consider another example:
pd.eval('df1.A + df2.A')
0 10
1 11
2 7
3 16
4 10
dtype: int32
Run Code Online (Sandbox Code Playgroud)
If you wanted to (for example) assign this back to a DataFrame, you could use the df2
argument as follows:
df = pd.DataFrame(columns=list('FBGH'), index=df1.index)
df
F B G H
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
df = pd.eval('B = df1.A + df2.A', target=df)
# Similar to
# df = df.assign(B=pd.eval('df1.A + df2.A'))
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
Run Code Online (Sandbox Code Playgroud)
If you wanted to perform an in-place mutation on target
, set df
.
pd.eval('B = df1.A + df2.A', target=df, inplace=True)
# Similar to
# df['B'] = pd.eval('df1.A + df2.A')
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
Run Code Online (Sandbox Code Playgroud)
If inplace=True
is set without a target, a inplace
is raised.
While the ValueError
argument is fun to play around with, you will seldom need to use it.
If you wanted to do this with target
, you would use an expression involving an assignment:
df = df.eval("B = @df1.A + @df2.A")
# df.eval("B = @df1.A + @df2.A", inplace=True)
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
Run Code Online (Sandbox Code Playgroud)
Note
One of df.eval
's unintended uses is parsing literal strings in a manner very similar to pd.eval
:
pd.eval("[1, 2, 3]")
array([1, 2, 3], dtype=object)
Run Code Online (Sandbox Code Playgroud)
It can also parse nested lists with the ast.literal_eval
engine:
pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')
[[1, 2, 3], [4, 5], [10]]
Run Code Online (Sandbox Code Playgroud)
And lists of strings:
pd.eval(["[1, 2, 3]", "[4, 5]", "[10]"], engine='python')
[[1, 2, 3], [4, 5], [10]]
Run Code Online (Sandbox Code Playgroud)
The problem, however, is for lists with length larger than 10:
pd.eval(["[1]"] * 100, engine='python') # Works
pd.eval(["[1]"] * 101, engine='python')
AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'
Run Code Online (Sandbox Code Playgroud)
More information can this error, causes, fixes, and workarounds can be found here.
'python'
- A Juxtaposition with DataFrame.eval
As mentioned above, pandas.eval
calls df.eval
under the hood. The v0.23 source code shows this:
def eval(self, expr, inplace=False, **kwargs):
from pandas.core.computation.eval import eval as _eval
inplace = validate_bool_kwarg(inplace, 'inplace')
resolvers = kwargs.pop('resolvers', None)
kwargs['level'] = kwargs.pop('level', 0) + 1
if resolvers is None:
index_resolvers = self._get_index_resolvers()
resolvers = dict(self.iteritems()), index_resolvers
if 'target' not in kwargs:
kwargs['target'] = self
kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
return _eval(expr, inplace=inplace, **kwargs)
Run Code Online (Sandbox Code Playgroud)
pd.eval
creates arguments, does a little validation, and passes the arguments on to eval
.
For more, you can read on: when to use DataFrame.eval() versus pandas.eval() or python eval()
For dynamic queries associated with entire DataFrames, you should prefer pd.eval
. For example, there is no simple way to specify the equivalent of pd.eval
when you call pd.eval("df1 + df2")
or df1.eval
.
Another other major difference is how columns are accessed. For example, to add two columns "A" and "B" in df2.eval
, you would call df1
with the following expression:
pd.eval("df1.A + df1.B")
Run Code Online (Sandbox Code Playgroud)
With df.eval, you need only supply the column names:
df1.eval("A + B")
Run Code Online (Sandbox Code Playgroud)
Since, within the context of pd.eval
, it is clear that "A" and "B" refer to column names.
You can also refer to the index and columns using df1
(unless the index is named, in which case you would use the name).
df1.eval("A + index")
Run Code Online (Sandbox Code Playgroud)
Or, more generally, for any DataFrame with an index having 1 or more levels, you can refer to the kth level of the index in an expression using the variable "ilevel_k" which stands for "index at level k". IOW, the expression above can be written as index
.
These rules also apply to df1.eval("A + ilevel_0")
.
Variables supplied inside expressions must be preceeded by the "@" symbol, to avoid confusion with column names.
A = 5
df1.eval("A > @A")
Run Code Online (Sandbox Code Playgroud)
The same goes for query
/
It goes without saying that your column names must follow the rules for valid identifier naming in python to be accessible inside query
. See here for a list of rules on naming identifiers.
A little known fact is that eval
support multiline expressions that deal with assignment. For example, to create two new columns "E" and "F" in df1 based on some arithmetic operations on some columns, and a third column "G" based on the previously created "E" and "F", we can do
df1.eval("""
E = A + B
F = @df2.A + @df2.B
G = E >= F
""")
A B C D E F G
0 5 0 3 3 5 14 False
1 7 9 3 5 16 7 True
2 2 4 7 6 6 5 True
3 8 8 1 6 16 9 True
4 7 7 8 1 14 10 True
Run Code Online (Sandbox Code Playgroud)
...Nifty! However, note that this is not supported by eval
.
query
v/s eval
- Final WordIt helps to think of query
as a function that uses df.query
as a subroutine.
Typically, pd.eval
(as the name suggests) is used to evaluate conditional expressions (i.e., expressions that result in True/False values) and return the rows corresponding to the query
result. The result of the expression is then passed to True
(in most cases) to return the rows that satisfy the expression. According to the documentation,
The result of the evaluation of this expression is first passed to
loc
and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed toDataFrame.loc
.This method uses the top-level
DataFrame.__getitem__()
function to evaluate the passed query.
In terms of similarity, pandas.eval()
and query
are both alike in how they access column names and variables.
This key difference between the two, as mentioned above is how they handle the expression result. This becomes obvious when you actually run an expression through these two functions. For example, consider
df1.A
0 5
1 7
2 2
3 8
4 7
Name: A, dtype: int32
df1.B
0 9
1 3
2 0
3 1
4 7
Name: B, dtype: int32
Run Code Online (Sandbox Code Playgroud)
To get all rows where "A" >= "B" in df.eval
, we would use df1
like this:
m = df1.eval("A >= B")
m
0 True
1 False
2 False
3 True
4 True
dtype: bool
Run Code Online (Sandbox Code Playgroud)
eval
represents the intermediate result generated by evaluating the expression "A >= B". We then use the mask to filter m
:
df1[m]
# df1.loc[m]
A B C D
0 5 0 3 3
3 8 8 1 6
4 7 7 8 1
Run Code Online (Sandbox Code Playgroud)
However, with df1
, the intermediate result "m" is directly passed to query
, so with loc
, you would simply need to do
df1.query("A >= B")
A B C D
0 5 0 3 3
3 8 8 1 6
4 7 7 8 1
Run Code Online (Sandbox Code Playgroud)
Performance wise, it is exactly the same.
df1_big = pd.concat([df1] * 100000, ignore_index=True)
%timeit df1_big[df1_big.eval("A >= B")]
%timeit df1_big.query("A >= B")
14.7 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.7 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Run Code Online (Sandbox Code Playgroud)
But the latter is more concise, and expresses the same operation in a single step.
Note that you can also do weird stuff with query
like this (to, say, return all rows indexed by df1.index)
df1.query("index")
# Same as df1.loc[df1.index] # Pointless,... I know
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
3 8 8 1 6
4 7 7 8 1
Run Code Online (Sandbox Code Playgroud)
But don't.
Bottom line: Please use query
when querying or filtering rows based on a conditional expression.
已经很棒的教程,但请记住,在eval/query
被其更简单的语法吸引到疯狂使用之前,如果您的数据集少于 15,000 行,它会出现严重的性能问题。
在这种情况下,只需使用df.loc[mask1, mask2]
.
参考:https : //pandas.pydata.org/pandas-docs/version/0.22/enhancingperf.html#enhancingperf-eval
归档时间: |
|
查看次数: |
7156 次 |
最近记录: |