PyArrow 表：过滤行

Question

PyArrow 表：过滤行

我有一个RecordBatch来自 Plasma DataStore 的文件，我可以将其读入 apyarrow.RecordBatch或 a中pyarrow.Table。我现在尝试在将其转换为 pandas ( to_pandas) 之前过滤掉行。

有没有办法filter在上使用新的 Dataset API（可以在 ParquetDataset 上使用）中的方法pyarrow.Table？这将使我能够使用这样的过滤器：

[[('date', '=', '2020-01-01')]]

查看源代码，pyarrow.Table和pyarrow.RecordBatch似乎都有一个过滤功能，但至少RecordBatch需要一个布尔掩码。

这可能吗？原因是数据集包含大量非零拷贝的字符串（和/或类别），因此运行to_pandas实际上会引入显着的延迟，而我每次只查找大约 20% 的数据集。

问候，
尼克拉斯

Answer 1

Art*_*hur 9

现在这是可能的：

import pyarrow as pa

my_table = pa.Table.from_arrays(
    [pa.array(['foo', 'bar', 'foo'], pa.string())],
    names=['col1']
)

# Using the high level API with expressions:
filtered_table = my_table.filter(pa.compute.field("col1") == "FOO")

# Using a lower level API:
filtered_table = my_table.filter(pa.compute.equal(my_table['col1'], 'foo'))

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，7 月前
查看次数：	10081 次
最近记录：	2 年，2 月前