二进制文件和数据结构中数字数据的存储

And*_*hia 2 c binaryfiles data-structures

这里的话题很大,我是新手,我正在寻找一些方向,因为这个话题的可能性似乎无穷无尽。

我正在运行数值模拟,这会创建大量数据,我想不再将其存储在纯文本中(一旦我尝试保存创建的所有数据并最终得到一个 4TB txt 文件)。

我的模拟涉及一个间隔内的 4 个字段(每个字段由通常包含 4000 到 16000 个元素的双精度数组表示),并且它们每次都会演化大约 100 万个周期,因此我们正在谈论生成的数十亿个双精度。

当然,我不会每次都保存所有内容,而是使用 3 种类型的文件(这些文件是出于简短原因的模型,我的实际文件都是以 %g 格式编写的,因此它们采用这 7 个字符+表格):

  1. 保存所有时间步的特定点的字段内容的文件,例如:

    t     Phi    Pi    Delta    A
    0     1.3    0.4   0.3      0.99
    ...
    
    Run Code Online (Sandbox Code Playgroud)
  2. 保存特定时间步长的所有时间间隔内的所有字段的文件

    x     Phi   Pi    Delta    A
    0     0.0   0.4   0.0      1.0
    ...
    
    Run Code Online (Sandbox Code Playgroud)
  3. 保存每n步时间和空间的文件

    t    x    Phi    Pi    Delta    A
    0.0  0.0  0.0    1.3   0.0      1.0
    0.0  0.1  0.01   1.2   0.02     0.98
    ...
    0.2  0.0  0.0    1.3   0.0      1.0
    0.2  0.1  0.03   1.5   0.01     0.95
    
    Run Code Online (Sandbox Code Playgroud)

然后,我将这些文件用于各种目的,例如绘制图形、对它们进行傅里叶变换以及使用它们来恢复模拟。

我最终需要在集群上运行它,所以我仅限于 C,目前我不知道他们是否有任何数据库/大数据系统。

我的问题是:

  1. 存储这些数据的最佳格式是什么,我假设它只是将双精度数保存为原始二进制文件,然后编写一个程序稍后检索它们,但我愿意接受建议
  2. 组织这些数据的最佳方式是什么?我环顾四周,也许我可以写一棵树,其中叶子是数组
  3. 那么压缩呢?

Nom*_*mal 5

\n

存储这些数据的最佳格式是什么

\n
\n\n

这取决于精度和值的结构。

\n\n

如果 7 位有效十进制数字的精度足够,并且值在 2 -126到 2 127范围内(1.17549\xc3\x9710 -38到 1.70141\xc3\x9710 38),则可以使用 IEEE-754二进制32格式。在所有用于高性能计算的机器和集群上,float类型都与此相对应。

\n\n

If you need 15 significant decimal digits of precision, and/or range from 2-1023 to 21023 (1.11254\xc3\x9710-308 to 8.98847\xc3\x9710308), use IEEE-754 binary64 format. Again, on all machines and clusters used for high-performance computing, the double type corresponds to this.

\n\n

The remaining problem is byte order and field identification.

\n\n

Assuming you do not wish to expend any HPC resources for data conversion during computation, it is best to store the data in native byte order, but include a header in the file that contains a known "prototype" value for each value type, so that a reader can check them to verify if byte order compensation is needed to correctly interpret the fields; plus descriptors for each of the fields.

\n\n

(As an example, I\'ve implemented this in a way that allows the files to be easily read in C and native Fortran 95 with minimal compiler extensions, also allowing each compute node to save the results in a local file, with readers automatically obtaining the data from multiple files in parallel. I typically only support u8, s8, u16, s16, u32, s32, u64, s64 (for unsigned and signed integers of various bit sizes), and r32 and r64 for single- and double-precision reals, or Binary32 and Binary64, respectively). I have not needed complex number formats yet.)

\n\n

Most people prefer to use e.g. NetCDF for this, but my approach differs in that writers produce the data in native format, rather than a normalized format; my intent being minimizing the overhead at data creation/simulation time, and pushing all overhead to the readers.

\n\n

If you find the small overhead at file generation time (during simulation) to be acceptable, and do not have experience in writing binary file format routines, I do recommend you use NetCDF.

\n\n

Do note that if HPC cluster operators find your simulation/computation wastes resources (for example, average core CPU load is low, or it does not scale well to multiple cores), you may not be allowed to run your simulation in a cluster. This depends on local politics and policys too, obviously.

\n\n
\n

What is the best way to organize this data?

\n
\n\n

Because of the very large amount of data, parallel files may be your best option. (Some clusters have fast local storage, in which case storing data directly from each node to a local file, and collecting those local files in a bunch after the run, may be preferable. As it varies, ask your cluster admin.)

\n\n

In other words, one file per one related array of data.

\n\n

It is not difficult to write a library that can read from multiple files in parallel, but correctly parsing and managing structured files is much harder.

\n\n

Furthermore, splitting the data into separate files often makes data transport easier. If you have a data file 16 TiB in size, you are basically limited to network transport, and may even be limited as to which filesystems you can use. However, if you have say 128 files where each is around 128 GiB in size, you have many more options, and can probably keep some of them in offline storage, while working on others. In particular, many HPC cluster operators will let you transfer the files to local media storage devices (USB3 disks or memory sticks) directly, to reduce network transfer congestion.

\n\n
\n

What about compression?

\n
\n\n

You can compress the data if needed, but I personally would do it at the point where the data is collected/combined/processed on your own workstation, not at the point where it is generated. HPC computation is expensive; it is much cheaper to munge the data as you first process it.

\n\n

The binary data does not compress as well as text does, but the text files are much larger at the same data resolution. That means it is important to choose the correct value type used to store each parameter anyway. And you want to keep that type across the entire set, not change from record to another, to keep processing simple.

\n\n

As to the compression/decompression algorithms, I\'d choose between zlib and xz. See e.g. here for a quick look at the speed/compression ratio curves. Simply put, zlib is fast but provides modest compression ratios, whereas xz is slower but provides much better compression ratios.

\n