如何在python中将JSON结果转换为Parquet？

Question

如何在python中将JSON结果转换为Parquet？

按照下面的脚本将 JSON 文件转换为 parquet 格式。我正在使用 pandas 库来执行转换。但是发生了以下错误： AttributeError: 'DataFrame' object has no attribute 'schema' 我还是 Python 的新手。

这是我使用的原始 json 文件： [ { "a": "01", "b": "teste01" }, { "a": "02", "b": "teste02" } ]

我究竟做错了什么？

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_json('C:/python/json_teste')

pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')

Run Code Online (Sandbox Code Playgroud)

错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-1b4ced833098> in <module>
----> 1 pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')

C:\Anaconda\lib\site-packages\pyarrow\parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, **kwargs)
   1256     try:
   1257         with ParquetWriter(
-> 1258                 where, table.schema,
   1259                 filesystem=filesystem,
   1260                 version=version,

C:\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5066                 return self[name]
-> 5067             return object.__getattribute__(self, name)
   5068 
   5069     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'schema'

Run Code Online (Sandbox Code Playgroud)

打印文件：

#print 
print(df)
   a        b
0  1  teste01
1  2  teste02

#following columns
df.columns
Index(['a', 'b'], dtype='object')

#following types
df.dtypes
a     int64
b    object
dtype: object

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 8

您还可以直接读取 JSON 文件，pyarrow如下例所示：

from pyarrow import json
import pyarrow.parquet as pq

table = json.read_json('C:/python/json_teste') 
pq.write_table(table, 'C:/python/result.parquet')  # save json/table as parquet

Run Code Online (Sandbox Code Playgroud)

参考：使用 pyarrow.parquet 进行读写

Answer 2

Fel*_*ose 6

您可以通过 pyspark 实现您想要的目标，如下所示：

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("JsonToParquetPysparkExample") \
    .getOrCreate()

json_df = spark.read.json("C://python/test.json", multiLine=True,) 
json_df.printSchema()
json_df.write.parquet("C://python/output.parquet")

Run Code Online (Sandbox Code Playgroud)

Answer 3

Des*_*ngh 1

欢迎来到 Stackoverflow，您正在使用的库在示例中显示您需要在数据框中写入列名称。尝试使用数据框的列名称，它会起作用。

# Given PyArrow schema
import pyarrow as pa
schema = pa.schema([
    pa.field('my_column', pa.string),
    pa.field('my_int', pa.int64),
])
convert_json(input_filename, output_filename, schema)

Run Code Online (Sandbox Code Playgroud)

参考：json2parquet

归档时间：	5 年，11 月前
查看次数：	12057 次
最近记录：	4 年，4 月前