极坐标比 numpy 慢？

Question

极坐标比 numpy 慢？

我正在考虑在解析问题polars中使用 in 代替numpy，将结构化文本文件转换为字符表并在不同的列上进行操作。然而，这似乎比我执行的大多数操作polars慢大约 5 倍。numpy我想知道为什么会出现这种情况，以及考虑到应该polars更快，我是否做错了什么。

例子：

import requests
import numpy as np
import polars as pl

# Download the text file
text = requests.get("https://files.rcsb.org/download/3w32.pdb").text

# Turn it into a 2D array of characters
char_tab_np = np.array(file.splitlines()).view(dtype=(str,1)).reshape(-1, 80)

# Create a polars DataFrame from the numpy array
char_tab_pl = pl.DataFrame(char_tab_np)

# Sort by first column with numpy
char_tab_np[np.argsort(char_tab_np[:,0])]

# Sort by first column with polars
char_tab_pl.sort(by="column_0")

Run Code Online (Sandbox Code Playgroud)

使用%%timeitin 时Jupyter，numpy排序大约需要320 微秒，而polars排序大约需要1.3 毫秒，即慢了大约五倍。

我也尝试过char_tab_pl.lazy().sort(by="column_0").collect()，但对持续时间没有影响。

另一个例子（取第一列等于“A”的所有行）：

# with numpy
%%timeit
char_tab_np[char_tab_np[:, 0] == "A"]

Run Code Online (Sandbox Code Playgroud)

# with polars
%%timeit
char_tab_pl.filter(pl.col("column_0") == "A")

Run Code Online (Sandbox Code Playgroud)

同样，numpy需要 226 微秒，而polars需要 673 微秒，大约慢三倍。

更新

根据评论我尝试了另外两件事：

1. 将文件放大 1000 倍，看看 Polars 在更大的数据上是否表现更好。

结果：numpy仍然快了大约 2 倍（1.3 毫秒 vs. 2.1 毫秒）。此外，创建字符数组大约需要numpy2 秒，而创建数据帧大约polars需要2 分钟，即慢了 60 倍。

要重新生成，只需text *= 1000在上面的代码中创建 numpy 数组之前添加即可。

2. 转换为整数。

对于原始（较小的）文件，转换为 int 可以加快和的numpy过程polars。过滤numpy仍然比 30 微秒快 5 倍polars（30 微秒 vs. 120），而排序时间变得更加相似（numpy 为 150 微秒，极坐标为 200 微秒）。

然而，对于大文件，polars比稍快numpy，但巨大的实例化时间使得只有在数据帧被查询数千次时才值得。

Answer 1

rit*_*e46 4

Polars 在过滤字符串数据方面做了额外的工作，在这种情况下这是不值得的。Polars 使用箭头大 utf8 缓冲区来存储其字符串数据。这使得过滤比过滤 python 字符串/字符（例如指针或 u8 字节）更昂贵。

有时值得，有时则不值得。如果您有同质数据，numpy 比极坐标更适合。如果您有异构数据，极坐标可能会更快。特别是如果您考虑整个查询而不是这些微观基准。

归档时间：	3 年前
查看次数：	2367 次
最近记录：	3 年前