在 Python 中获取镶木地板文件的架构

Question

在 Python 中获取镶木地板文件的架构

是否有任何可用于获取镶木地板文件模式的 Python 库？

目前我们正在将镶木地板文件加载到 Spark 中的数据框中，并从数据框中获取模式以显示在应用程序的某些 UI 中。但是初始化 spark-context 和加载数据帧并从数据帧中获取模式是耗时的活动。所以寻找一种替代方法来获取模式。

Answer 1

此函数返回表示 parquet 文件的本地 URI 的架构。该模式作为可用的 Pandas 数据帧返回。该函数不读取整个文件，仅读取架构。

import pandas as pd
import pyarrow.parquet


def read_parquet_schema_df(uri: str) -> pd.DataFrame:
    """Return a Pandas dataframe corresponding to the schema of a local URI of a parquet file.

    The returned dataframe has the columns: column, pa_dtype
    """
    # Ref: https://stackoverflow.com/a/64288036/
    schema = pyarrow.parquet.read_schema(uri, memory_map=True)
    schema = pd.DataFrame(({"column": name, "pa_dtype": str(pa_dtype)} for name, pa_dtype in zip(schema.names, schema.types)))
    schema = schema.reindex(columns=["column", "pa_dtype"], fill_value=pd.NA)  # Ensures columns in case the parquet file has an empty dataframe.
    return schema

Run Code Online (Sandbox Code Playgroud)

它使用以下版本的所用第三方软件包进行了测试：

$ pip list | egrep 'pandas|pyarrow'
pandas             1.1.3
pyarrow            1.0.1

Run Code Online (Sandbox Code Playgroud)

Answer 2

Gal*_*ses 6

除了@mehdio 的回答之外，如果您的镶木地板是一个目录（例如由 spark 生成的镶木地板），请读取架构/列名称：

import pyarrow.parquet as pq
pfile = pq.read_table("file.parquet")
print("Column names: {}".format(pfile.column_names))
print("Schema: {}".format(pfile.schema))

Run Code Online (Sandbox Code Playgroud)

如果文件太大无法读入内存怎么办？ (5认同)

Answer 3

Uwe*_*orn 5

这是通过使用pyarrow( https://github.com/apache/arrow/ )来支持的。

from pyarrow.parquet import ParquetFile
# Source is either the filename or an Arrow file handle (which could be on HDFS)
ParquetFile(source).metadata

Run Code Online (Sandbox Code Playgroud)

注意：我们昨天才合并了这个代码，所以你需要从源代码构建它，见https://github.com/apache/arrow/commit/f44b6a3b91a15461804dd7877840a557caa52e4e

这有效，但不能将响应作为字典或数组而不是普通文本返回吗？ (2认同)

归档时间：	9 年，1 月前
查看次数：	17507 次
最近记录：	5 年，2 月前