将 pandas 数据帧转换为 Polars 数据帧时出错（pyarrow.lib.ArrowTypeError：预期字节，得到“int”对象）

Question

将 pandas 数据帧转换为 Polars 数据帧时出错（pyarrow.lib.ArrowTypeError：预期字节，得到“int”对象）

Rah*_*hil 7 dataframe pandas python-polars

我正在将 pandas 数据帧转换为 Polars 数据帧，但 pyarrow 抛出错误。

我的代码：

import polars as pl
import pandas as pd

if __name__ == "__main__":

    with open(r"test.xlsx", "rb") as f:
        excelfile = f.read()
    excelfile = pd.ExcelFile(excelfile)
    sheetnames = excelfile.sheet_names
    df = pd.concat(
        [
            pd.read_excel(
            excelfile, sheet_name=x, header=0)
                    for x in sheetnames
                    ], axis=0)

    df_pl = pl.from_pandas(df)

Run Code Online (Sandbox Code Playgroud)

错误：

File "pyarrow\array.pxi", line 312, in pyarrow.lib.array

File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array

File "pyarrow\error.pxi", line 122, in pyarrow.lib.check_status

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

我尝试将 pandas dataframe 更改dtype为str并解决了问题，但我不想更改dtypes. 是 pyarrow 中的错误还是我错过了什么？

Answer 1

小智 4

编辑：Polars`0.13.42`及以后

Polars 现在有一个read_excel功能可以正确处理这种情况。 read_excel现在是将 Excel 文件读入 Polars 的首选方式。

注意：要使用read_excel，您需要安装xlsx2csv（可以使用 pip 安装）。

极地：之前`0.13.42`

我可以复制这个结果。这是由于原始 Excel 文件中的一列同时包含文本和数字。

例如，创建一个新的 Excel 文件，其中包含一列，在其中键入数字和文本，保存它，然后在该文件上运行代码。我得到以下回溯：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/convert.py", line 299, in from_pandas
    return DataFrame._from_pandas(df, rechunk=rechunk, nan_to_none=nan_to_none)
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 454, in _from_pandas
    pandas_to_pydf(
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 485, in pandas_to_pydf
    arrow_dict = {
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 486, in <dictcomp>
    str(col): _pandas_series_to_arrow(
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 237, in _pandas_series_to_arrow
    return pa.array(values, pa.large_utf8(), from_pandas=nan_to_none)
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

Run Code Online (Sandbox Code Playgroud)

关于这个问题有一些冗长的讨论，例如：

此特定注释可能相关，因为您正在连接 Excel 文件中解析多个工作表的结果。这可能会导致列的数据类型冲突： https://github.com/pandas-dev/pandas/issues/21228#issuecomment-419175116

如何解决这个问题取决于您的数据及其用途，因此我不能推荐一揽子解决方案（即修复源 Excel 文件，或将 dtype 更改为 str）。

归档时间：	4 年，3 月前
查看次数：	35817 次
最近记录：	4 年前

将 pandas 数据帧转换为 Polars 数据帧时出错（pyarrow.lib.ArrowTypeError：预期字节，得到“int”对象）

编辑：Polars0.13.42及以后

极地：之前0.13.42

编辑：Polars`0.13.42`及以后

极地：之前`0.13.42`