两者都是用于数据分析系统的柱状(磁盘)存储格式.两者都集成在Apache Arrow(用于python的pyarrow包)中,旨在与Arrow对应作为柱状内存分析层.
两种格式有何不同?
在可能的情况下,你是否总是喜欢使用羽毛?
附录
我在这里找到了一些提示https://github.com/wesm/feather/issues/188,但考虑到这个项目的年龄,它可能有点过时了.
不是一个严肃的速度测试,因为我只是倾倒并加载一个完整的Dataframe,但如果您之前从未听说过这些格式,那么会给您一些印象:
# IPython
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather
import pyarrow.parquet as pq
import fastparquet as fp
df = pd.DataFrame({'one': [-1, np.nan, 2.5],
'two': ['foo', 'bar', 'baz'],
'three': [True, False, True]})
print("pandas df to disk ####################################################")
print('example_feather:')
%timeit feather.write_feather(df, 'example_feather')
# 2.62 ms ± 35.8 µs per loop …Run Code Online (Sandbox Code Playgroud) 我有一种使用boto3(1.4.4),pyarrow(0.4.1)和pandas(0.20.3)实现这一目标的hacky方法.
首先,我可以在本地读取单个镶木地板文件,如下所示:
import pyarrow.parquet as pq
path = 'parquet/part-r-00000-1e638be4-e31f-498a-a359-47d017a0059c.gz.parquet'
table = pq.read_table(path)
df = table.to_pandas()
Run Code Online (Sandbox Code Playgroud)
我也可以在本地读取镶木地板文件目录,如下所示:
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('parquet/')
table = dataset.read()
df = table.to_pandas()
Run Code Online (Sandbox Code Playgroud)
两者都像魅力一样.现在我想用存储在S3存储桶中的文件远程实现相同的功能.我希望这样的东西能起作用:
dataset = pq.ParquetDataset('s3n://dsn/to/my/bucket')
Run Code Online (Sandbox Code Playgroud)
但它没有:
OSError: Passed non-file path: s3n://dsn/to/my/bucket
在仔细阅读了pyarrow的文档后,目前似乎无法做到这一点.所以我提出了以下解决方案:
从S3读取单个文件并获取pandas数据帧:
import io
import boto3
import pyarrow.parquet as pq
buffer = io.BytesIO()
s3 = boto3.resource('s3')
s3_object = s3.Object('bucket-name', 'key/to/parquet/file.gz.parquet')
s3_object.download_fileobj(buffer)
table = pq.read_table(buffer)
df = table.to_pandas()
Run Code Online (Sandbox Code Playgroud)
在这里,我的hacky,not-so-optimized,解决方案从S3文件夹路径创建一个pandas数据框:
import io
import …Run Code Online (Sandbox Code Playgroud) 如何附加/更新parquet文件pyarrow?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})
pq.write_table(table2, './dataNew/pqTest2.parquet')
#append pqTest2 here?
Run Code Online (Sandbox Code Playgroud)
我在文档中找不到任何关于附加镶木地板文件的内容.并且,您可以使用pyarrow 多处理来插入/更新数据.
我正在尝试启用 Apache Arrow 以转换为 Pandas。我在用:
pyspark 2.4.4 pyarrow 0.15.0 熊猫 0.25.1 numpy 1.17.2
这是示例代码
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
x = pd.Series([1, 2, 3])
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
Run Code Online (Sandbox Code Playgroud)
我收到此警告消息
c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
An error occurred while calling z:org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile.
: java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.<init>(ArrowConverters.scala:229)
at org.apache.spark.sql.execution.arrow.ArrowConverters$.getBatchesFromStream(ArrowConverters.scala:228)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:216)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:214)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
at org.apache.spark.sql.execution.arrow.ArrowConverters$.readArrowStreamFromFile(ArrowConverters.scala:214)
at org.apache.spark.sql.api.python.PythonSQLUtils$.readArrowStreamFromFile(PythonSQLUtils.scala:46)
at org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile(PythonSQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at …Run Code Online (Sandbox Code Playgroud) 是否可以使用 Pandas 的DataFrame.to_parquet功能将写入拆分为多个具有近似所需大小的文件?
我有一个非常大的 DataFrame (100M x 100),并且正在用来df.to_parquet('data.snappy', engine='pyarrow', compression='snappy')写入一个文件,但这会产生一个大约 4GB 的文件。相反,我希望将其分成许多约 100MB 的文件。
我正在尝试在我的 EMR 集群的主实例上安装 pyarrow,但是我总是收到此错误。
[hadoop@ip-XXX-XXX-XXX-XXX ~]$ sudo /usr/bin/pip-3.4 install pyarrow
Collecting pyarrow
Downloading https://files.pythonhosted.org/packages/c0/a0/f7e9dfd8988d94f4952f9b50eb04e14a80fbe39218520725aab53daab57c/pyarrow-0.10.0.tar.gz (2.1MB)
100% |????????????????????????????????| 2.2MB 643kB/s
Requirement already satisfied: numpy>=1.10 in /usr/local/lib64/python3.4/site-packages (from pyarrow)
Requirement already satisfied: six>=1.0.0 in /usr/local/lib/python3.4/site-packages (from pyarrow)
Installing collected packages: pyarrow
Running setup.py install for pyarrow ... error
Complete output from command /usr/bin/python3.4 -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-pr3y5_mu/pyarrow/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-vmywdpeg-record/install-record.txt --single-version-externally-managed --compile:
/usr/lib64/python3.4/distutils/dist.py:260: UserWarning: Unknown distribution option: 'long_description_content_type'
warnings.warn(msg)
/mnt/tmp/pip-build-pr3y5_mu/pyarrow/.eggs/setuptools_scm-3.1.0-py3.4.egg/setuptools_scm/utils.py:118: UserWarning: 'git' was not found
running …Run Code Online (Sandbox Code Playgroud) 使用以下代码pyarrow将pandas.DataFrame包含Player对象转换为 apyarrow.Table
import pandas as pd
import pyarrow as pa
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def __repr__(self):
return f'<{self.name} ({self.age})>'
data = [
Player('Jack', 21, 'm'),
Player('Ryan', 18, 'm'),
Player('Jane', 35, 'f'),
]
df = pd.DataFrame(data, columns=['player'])
print(pa.Table.from_pandas(df))
Run Code Online (Sandbox Code Playgroud)
我们得到错误:
pyarrow.lib.ArrowInvalid: ('Could not convert <Jack (21)> with type Player: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed …Run Code Online (Sandbox Code Playgroud) 当我将 pyarrow 设置为 true 时,我们使用 Spark 会话,但是当我运行 toPandas() 时,它会抛出错误:
"toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this"
Run Code Online (Sandbox Code Playgroud)
我可以知道为什么会这样吗?
我想将 Pandas DataFrame 保存到 parquet,但我有一些不受支持的类型(例如 bson ObjectIds)。
在整个示例中,我们使用:
import pandas as pd
import pyarrow as pa
Run Code Online (Sandbox Code Playgroud)
这是一个显示这种情况的最小示例:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': ObjectId('5e9992543bfddb58073803e7')},
{'name': 'bob', 'oid': ObjectId('5e9992543bfddb58073803e8')},
]
)
df.to_parquet('some_path')
Run Code Online (Sandbox Code Playgroud)
我们得到:
ArrowInvalid: ('Could not convert 5e9992543bfddb58073803e7 with type ObjectId: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column oid with type object')
Run Code Online (Sandbox Code Playgroud)
我试图按照这个参考:https : //arrow.apache.org/docs/python/extending_types.html
因此我编写了以下类型扩展:
class ObjectIdType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(12), "my_package.objectid")
def __arrow_ext_serialize__(self):
# since we don't have …Run Code Online (Sandbox Code Playgroud)