使用h5py访问数据范围

Question

使用h5py访问数据范围

我有一个包含62个不同属性的h5文件.我想访问其中每个数据的数据范围.

在这里解释一下我在做什么

import h5py 
the_file =  h5py.File("myfile.h5","r")
data = the_file["data"]
att = data.keys()

Run Code Online (Sandbox Code Playgroud)

前面的代码给了我一个属性列表"U","T","H",.....等

我想说我想知道"U"的最小值和最大值是多少.我怎样才能做到这一点？

这是运行"h5dump -H"的输出

HDF5 "myfile.h5" {
GROUP "/" {
   GROUP "data" {
      ATTRIBUTE "datafield_names" {
         DATATYPE  H5T_STRING {
               STRSIZE 8;
               STRPAD H5T_STR_SPACEPAD;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
         DATASPACE  SIMPLE { ( 62 ) / ( 62 ) }
      }
      ATTRIBUTE "dimensions" {
         DATATYPE  H5T_STD_I32BE
         DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      }
      ATTRIBUTE "time_variables" {
         DATATYPE  H5T_IEEE_F64BE
         DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      }
      DATASET "Temperature" {
         DATATYPE  H5T_IEEE_F64BE
         DATASPACE  SIMPLE { ( 256, 512, 1024 ) / ( 256, 512, 1024 ) }
      }

Run Code Online (Sandbox Code Playgroud)

Answer 1

djh*_*ese 9

它可能是术语上的差异,但hdf5属性是通过attrsDataset对象的属性进行访问的.我称你有变量或数据集.无论如何...

我猜您的描述是属性只是数组,您应该能够执行以下操作来获取每个属性的数据,然后像任何numpy数组一样计算最小值和最大值:

attr_data = data["U"][:] # gets a copy of the array
min = attr_data.min()
max = attr_data.max()

Run Code Online (Sandbox Code Playgroud)

因此,如果您想要每个属性的最小值/最大值,您可以只对属性名称执行for循环,或者您可以使用

for attr_name,attr_value in data.items():
    min = attr_value[:].min()

Run Code Online (Sandbox Code Playgroud)

编辑以回答您的第一条评论:

h5py的对象可以像 python词典一样使用.因此,当您使用'keys()'实际上并没有获取数据时,您将获得该数据的名称(或密钥).例如,如果运行,the_file.keys()您将获得该hdf5文件的根路径中的每个hdf5数据集的列表.如果沿路径继续,您将得到保存实际二进制数据的数据集.例如,您可以从(首先在解释器中)开始:

the_file = h5py.File("myfile.h5","r")
print the_file.keys()
# this will result in a list of keys maybe ["raw_data","meta_data"] or something
print the_file["raw_data"].keys()
# this will result in another list of keys maybe ["temperature","humidity"]
# eventually you'll get to the dataset that actually has the data or attributes you are looking for
# think of this process as going through a directory structure or a path to get to a file (or a dataset/variable in this case)
the_data_var = the_file["raw_data"]["temperature"]
the_data_array = the_data_var[:]

print the_data_var.attrs.keys()
# this will result in a list of attribute names/keys
an_attr_of_the_data = data_var.attrs["measurement_time"][:]

# So now you have "the_data_array" which is a numpy array and "an_attr_of_the_data" which is whatever it happened to be
# you can get the min/max of the data by doing like before
print the_data_array.min()
print the_data_array.max()

Run Code Online (Sandbox Code Playgroud)

编辑2 - 为什么人们用这种方式格式化他们的hdf文件？它打败了目的.

我想如果可能的话,您可能需要与制作此文件的人交谈.如果你成功了,那么你将能够自己回答我的问题.首先,您确定在原始示例中data.keys()返回了"U","T",etc.吗？除非h5py正在做一些神奇的事情,或者你没有提供h5dump的所有输出,否则这不是你的输出.我将解释h5dump告诉我的内容,但请尝试了解我在做什么,而不仅仅是复制并粘贴到您的终端.

# Get a handle to the "data" Group
data = the_file["data"]
# As you can see from the dump this data group has 3 attributes and 1 dataset
# The name of the attributes are "datafield_names","dimensions","time_variables"
# This should result in a list of those names:
print data.attrs.keys()

# The name of the dataset is "Temperature" and should be the only item in the list returned by:
print data.keys()

Run Code Online (Sandbox Code Playgroud)

从h5dump可以看出,有62个datafield_names(字符串),4个dimensions(我认为是32位整数)和2个time_variables(64位浮点数).它还告诉我这Temperature是一个三维数组,256 x 512 x 1024(64位浮点数).你知道我在哪里得到这些信息吗？现在是困难的部分,你需要确定如何datafield_names与Temperature数组匹配.这是由制作文件的人完成的,因此您必须弄清楚Temperature数组中每行/列的含义.我的第一个猜测是Temperature数组中的每一行都是其中之一,datafield_names每次可能还有2行？但这不起作用,因为数组中的行太多.也许尺寸适合那里如何？最后,您将了解如何获取这些信息(从以前开始):

# Get the temperature array (I can't remember if the 3 sets of colons is required, but try it and if not just use one)
temp_array = data["Temperature"][:,:,:]
# Get all of the datafield_names (list of strings of length 62)
datafields = data.attrs["datafield_names"][:]
# Get all of the dimensions (list of integers of length 4)
dims = data.attrs["dimensions"][:]
# Get all of the time variables (list of floats of length 2)
time_variables = data.attrs["time_variables"]

# If you want the min/max of the entire temperature array this should work:
print temp_array.min()
print temp_array.max()
# If you knew that row 0 of the array had the temperatures you wanted to analyze
# then this would work, but it all depends on how the creator organized the data/file:
print temp_array[0].min()
print temp_array[1].max()

Run Code Online (Sandbox Code Playgroud)

对不起,我无法提供更多帮助,但实际上没有文件,知道每个字段的含义,这就是我能做的一切.尝试了解我如何使用h5py来读取信息.试着理解我如何将头信息(h5dump输出)转换成我实际可以用于h5py的信息.如果你知道如何在数组中组织数据,你应该能够做你想要的.祝你好运,如果可以,我会帮助更多.

归档时间：	12 年，11 月前
查看次数：	9843 次
最近记录：	10 年，4 月前