使用 Pandas 读取使用 h5py 创建的 HDF5 文件

Question

使用 Pandas 读取使用 h5py 创建的 HDF5 文件

我有一堆 hdf5 文件，我想将其中的一些数据转换为 parquet 文件。不过，我正在努力将它们读入 pandas/pyarrow 中。我认为这与文件最初创建的方式有关。

如果我使用 h5py 打开文件，数据看起来完全符合我的预期。

import h5py

file_path = "/data/some_file.hdf5"
hdf = h5py.File(file_path, "r")
print(list(hdf.keys()))

Run Code Online (Sandbox Code Playgroud)

给我

>>> ['foo', 'bar', 'baz']

Run Code Online (Sandbox Code Playgroud)

在本例中，我对“bar”组感兴趣，其中包含 3 个项目。

如果我尝试读取使用中的数据，HDFStore我将无法访问任何组。

>>> ['foo', 'bar', 'baz']

Run Code Online (Sandbox Code Playgroud)

那么该HDFStore对象就没有键或组。

import pandas as pd

file_path = "/data/some_file.hdf5"
store = pd.HDFStore(file_path, "r")

Run Code Online (Sandbox Code Playgroud)

如果我尝试访问数据，则会收到以下错误

assert not store.groups()
assert not store.keys()

Run Code Online (Sandbox Code Playgroud)

TypeError: cannot create a storer if the object is not existing nor a value are passed

Run Code Online (Sandbox Code Playgroud)

同样，如果我尝试使用pd.read_hdf它看起来文件是空的。

bar = store.get("/bar")

Run Code Online (Sandbox Code Playgroud)

ValueError: Dataset(s) incompatible with Pandas data types, not table, or no datasets found in HDF5 file.

Run Code Online (Sandbox Code Playgroud)

和

TypeError: cannot create a storer if the object is not existing nor a value are passed

Run Code Online (Sandbox Code Playgroud)

TypeError: cannot create a storer if the object is not existing nor a value are passed

Run Code Online (Sandbox Code Playgroud)

基于这个答案，我假设问题与 Pandas 期望一个非常特殊的层次结构这一事实有关，该结构与实际 hdf5 文件具有的结构不同。

将任意 hdf5 文件读入 pandas 或 pytables 的直接方法是什么？如果需要，我可以使用 h5py 加载数据。但这些文件足够大，如果可以的话，我希望避免将它们加载到内存中。所以理想情况下，我想尽可能多地从事 pandas 和 pyarrow 工作。

Answer 1

NeS*_*ack 5

我遇到了类似的问题，无法将 hdf5 读入 pandas df。在这篇文章中，我制作了一个脚本，将 hdf5 转换为字典，然后将字典转换为 pandas df，如下所示：

import h5py
import pandas as pd


dictionary = {}
with h5py.File(filename, "r") as f:
    for key in f.keys():
        print(key)

        ds_arr = f[key][()]   # returns as a numpy array
        dictionary[key] = ds_arr # appends the array in the dict under the key

df = pd.DataFrame.from_dict(dictionary)

Run Code Online (Sandbox Code Playgroud)

只要每个 hdf5 键 ( f.keys()) 只是您想要在 pandas df 中使用的列的名称而不是组名称，这就可以工作，这似乎是一个更复杂的层次结构，可以存在于 hdf5 中，但不存在在熊猫中。如果一个组出现在键上方的层次结构中，例如名称为，那么data_group对我来说，替代解决方案是f.keys()用f['data_group'].keys()和f[key]替换f['data_group'][key]

归档时间：	3 年，11 月前
查看次数：	4453 次
最近记录：	2 年，8 月前