我可以使用类似 pandas 的字符串表达式来过滤 DataFrame 吗？

Question

我可以使用类似 pandas 的字符串表达式来过滤 DataFrame 吗？

我正在考虑在一个允许用户输入谓词表达式来过滤/子集数据行的工具中pandas替换我的使用。polars这允许用户使用该pandas.DataFrame.query方法可以解析的表达式，例如"x > 1"，作为一个非常简单的示例。

但是，我似乎找不到一种方法来使用相同类型的字符串表达式，以便我可以在不要求用户更改其谓词表达式的情况下polars.DataFrame.filter进行交换。pandaspolars

我发现的唯一接近我的问题的是这个帖子：String as a condition in a filter

不幸的是，这并不是我所需要的，因为它仍然需要一个字符串表达式，"pl.col('x') > 1"而不是简单的"x > 1".

有没有办法使用更简单（“不可知”）的语法polars？

使用文档中的示例polars.DataFrame.filter：

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )

Run Code Online (Sandbox Code Playgroud)

调用时df.filter，我被迫使用如下表达式：

pl.col("foo") < 3
(pl.col("foo") < 3) & (pl.col("ham") == "a")

Run Code Online (Sandbox Code Playgroud)

但是，我希望能够分别使用以下字符串表达式，以便该工具的用户（当前使用pandas）不必了解polars特定的语法（从而允许我在不影响用户的情况下交换库）：

"foo < 3"
"foo < 3 & ham == 'a'"

Run Code Online (Sandbox Code Playgroud)

当我尝试这样做时，会发生以下情况，这令人费解，因为str是谓词参数支持的类型之一，因此不清楚谓词支持的语法，str因为文档没有显示任何此类示例：

>>> df.filter("foo < 3")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/envs/gedi_subset/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 2565, in filter
    self.lazy()
  File "/usr/local/Caskroom/miniconda/base/envs/gedi_subset/lib/python3.10/site-packages/polars/utils.py", line 391, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/gedi_subset/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 1165, in collect
    return pli.wrap_df(ldf.collect())
exceptions.NotFoundError: foo < 3

Run Code Online (Sandbox Code Playgroud)

我所期望的是返回的相同的返回值df.filter(pl.col("foo") < 3)。

Answer 1

rit*_*e46 5

您可以尝试使用SqlContext实现这一点。

\n

import polars as pl\nctxt = pl.SQLContext()\n\ndf = pl.DataFrame(\n    {\n        "foo": [1, 2, 3],\n        "bar": [6, 7, 8],\n        "ham": ["a", "b", "c"],\n    }\n)\n\nctxt.register("df", df.lazy())\n\nstring_expr = "foo < 3 and ham = \'a\'"\n\n(ctxt.query(f"""\nSELECT * FROM df\nWHERE {string_expr}\n"""))\n\n

Run Code Online (Sandbox Code Playgroud)\n

shape: (1, 1)\n\xe2\x94\x8c\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x90\n\xe2\x94\x82 x   \xe2\x94\x82\n\xe2\x94\x82 --- \xe2\x94\x82\n\xe2\x94\x82 i64 \xe2\x94\x82\n\xe2\x95\x9e\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\xa1\n\xe2\x94\x82 3   \xe2\x94\x82\n\xe2\x94\x94\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x98\n

Run Code Online (Sandbox Code Playgroud)\n

请注意，该SQL语言不像 pandas 那样使用按位&或相等==，因此您可能需要将&和and替换==为=。

\n

归档时间：	2 年，10 月前
查看次数：	614 次
最近记录：	2 年，10 月前