使用 Parquet 文件处理 Arrow 中的 UUID 值

Question

使用 Parquet 文件处理 Arrow 中的 UUID 值

我是 Python 和 Pandas 的新手 - 请温柔一点！

我使用 SqlAlchemy 和 pymssql 对 SQL Server 数据库执行 SQL 查询，然后将结果集转换为数据帧。然后我尝试将此数据帧写入 Parquet 文件：

  engine = sal.create_engine(connectionString)

  conn = engine.connect()
  df = pd.read_sql(query, con=conn)
  df.to_parquet(outputFile)

Run Code Online (Sandbox Code Playgroud)

我在 SQL 查询中检索的数据包括一个uniqueidentifier名为的列（即 UUID）rowguid。因此，我在上面的最后一行收到以下错误：

pyarrow.lib.ArrowInvalid: ("Could not convert UUID('92c4279f-1207-48a3-8448-4636514eb7e2') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column rowguid with type object')

Run Code Online (Sandbox Code Playgroud)

有什么方法可以强制所有 UUID 在上述事件链中的任何点都串起来吗？

一些额外的注意事项：

这部分代码的目标是接收 SQL 查询文本作为参数并充当通用 SQL 到 Parquet 函数。
我意识到我可以做类似的事情df['rowguid'] = df['rowguid'].astype(str)，但这依赖于我知道哪些列有uniqueidentifier类型。当它成为一个数据框时，一切都是一个object，每个查询都会不同。
我也知道我可以将它转换为char(36)SQL 查询本身，但是，我希望做一些更“自动”的事情，这样编写查询的人就不会一直意外地遇到这个问题/不必这样做请记住始终转换数据类型。

有任何想法吗？

Answer 1

kle*_*aum 0

尝试 DuckDB

尽管名称中有“DB”，DuckDB 是一个专门为数据分析任务设计的 Python 包，并不是完整的数据库替代品。它擅长处理数据类型转换，而无需使用 PyArrow 等库时可能遇到的自定义扩展。

import sqlalchemy as sal
import pandas as pd
import duckdb

# Define your SQL variables
connection_string = "your_connection_string_here"
query = "your_query_here"

# Connect to your SQL database using SQLAlchemy
engine = sal.create_engine(connection_string)
conn = engine.connect()

# Run your query and load the results into a DataFrame
df = pd.read_sql(query, con=conn)

# Close the SQL database connection
conn.close()


# My Solution
# With duckdb installed and imported implement the code below
output_file_path = "your_output_file_path_here"

# Connect to DuckDB in-memory
duck_conn = duckdb.connect(':memory:')

# Write the DataFrame (with complex types like UUID) to a 
# snappy-compressed Parquet file with DuckDB
duck_conn.query(f"COPY df TO '{output_file_path}' (FORMAT PARQUET)")

# Close the DuckDB connection
duck_conn.close()

Run Code Online (Sandbox Code Playgroud)

供考虑的附加说明：

值得一提的是，有一些使用自定义 UUIDType 扩展 PyArrow 的方法，如 Arrow 文档中详述。然而，根据我的经验，这些可能会导致进一步的 DataType 问题，特别是对于复杂的 Postgres 模式。这会造成很大的维护负担。因此，我发现使用 DuckDB 的本机功能来执行这些转换任务可以防止此类复杂情况，并推荐它作为更直接、更可靠的解决方案。

参考：

归档时间：	4 年，5 月前
查看次数：	2899 次
最近记录：	2 年，1 月前