Mat*_*tre 10 python json parquet
按照下面的脚本将 JSON 文件转换为 parquet 格式。我正在使用 pandas 库来执行转换。但是发生了以下错误: AttributeError: 'DataFrame' object has no attribute 'schema' 我还是 Python 的新手。
这是我使用的原始 json 文件: [ { "a": "01", "b": "teste01" }, { "a": "02", "b": "teste02" } ]
我究竟做错了什么?
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.read_json('C:/python/json_teste')
pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')
Run Code Online (Sandbox Code Playgroud)
错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-1b4ced833098> in <module>
----> 1 pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')
C:\Anaconda\lib\site-packages\pyarrow\parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, **kwargs)
1256 try:
1257 with ParquetWriter(
-> 1258 where, table.schema,
1259 filesystem=filesystem,
1260 version=version,
C:\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.__getattribute__(self, name)
5068
5069 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'schema'
Run Code Online (Sandbox Code Playgroud)
打印文件:
#print
print(df)
a b
0 1 teste01
1 2 teste02
#following columns
df.columns
Index(['a', 'b'], dtype='object')
#following types
df.dtypes
a int64
b object
dtype: object
Run Code Online (Sandbox Code Playgroud)
小智 8
您还可以直接读取 JSON 文件,pyarrow如下例所示:
from pyarrow import json
import pyarrow.parquet as pq
table = json.read_json('C:/python/json_teste')
pq.write_table(table, 'C:/python/result.parquet') # save json/table as parquet
Run Code Online (Sandbox Code Playgroud)
您可以通过 pyspark 实现您想要的目标,如下所示:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("JsonToParquetPysparkExample") \
.getOrCreate()
json_df = spark.read.json("C://python/test.json", multiLine=True,)
json_df.printSchema()
json_df.write.parquet("C://python/output.parquet")
Run Code Online (Sandbox Code Playgroud)
欢迎来到 Stackoverflow,您正在使用的库在示例中显示您需要在数据框中写入列名称。尝试使用数据框的列名称,它会起作用。
# Given PyArrow schema
import pyarrow as pa
schema = pa.schema([
pa.field('my_column', pa.string),
pa.field('my_int', pa.int64),
])
convert_json(input_filename, output_filename, schema)
Run Code Online (Sandbox Code Playgroud)
参考:json2parquet