从Pandas Dataframe写入格式化的二进制文件

Question

从Pandas Dataframe写入格式化的二进制文件

jbs*_*ssm 5 python numpy binaryfiles pandas

我已经看到了一些将Python中的格式化二进制文件读取到Pandas的方法，即，我正在使用这段代码，该代码使用NumPy从具有dtype给出的结构格式的文件中读取。

import numpy as np
import pandas as pd

input_file_name = 'test.hst'

input_file = open(input_file_name, 'rb')
header = input_file.read(96)

dt_header = np.dtype([('version', 'i4'),
                      ('copyright', 'S64'),
                      ('symbol', 'S12'),
                      ('period', 'i4'),
                      ('digits', 'i4'),
                      ('timesign', 'i4'),
                      ('last_sync', 'i4')])

header = np.fromstring(header, dt_header)

dt_records = np.dtype([('ctm', 'i4'),
                       ('open', 'f8'),
                       ('low', 'f8'),
                       ('high', 'f8'),
                       ('close', 'f8'),
                       ('volume', 'f8')])
records = np.fromfile(input_file, dt_records)

input_file.close()

df_records = pd.DataFrame(records)
# Now, do some changes in the individual values of df_records
# and then write it back to a binary file

Run Code Online (Sandbox Code Playgroud)

现在，我的问题是如何将其写回到新文件中。我在NumPy中找不到任何函数（在Pandas中都找不到），该函数允许我确切指定要在每个字段中写入的字节。

Answer 1

eba*_*arr 7

我不清楚是DataFrame视图还是副本，但假设它是副本，您可以to_records使用DataFrame.

这会返回一个记录数组，然后您可以使用将其放入磁盘tofile。

例如

df_records = pd.DataFrame(records)
# do some stuff
new_recarray = df_records.to_records()
new_recarray.tofile("myfile.npy")

Run Code Online (Sandbox Code Playgroud)

数据将以打包字节的形式驻留在内存中，其格式由重组数据类型描述。

Answer 2

Jos*_*der 6

Pandas 现在提供了比 tofile() 更稳定的多种格式。tofile() 最适合快速文件存储，在这种情况下，您不希望在数据可能具有不同字节序（大/小字节序）的不同机器上使用该文件。

Format Type Data Description     Reader         Writer
text        CSV                  read_csv       to_csv
text        JSON                 read_json      to_json
text        HTML                 read_html      to_html
text        Local clipboard      read_clipboard to_clipboard
binary      MS Excel             read_excel     to_excel
binary      HDF5 Format          read_hdf       to_hdf
binary      Feather Format       read_feather   to_feather
binary      Parquet Format       read_parquet   to_parquet
binary      Msgpack              read_msgpack   to_msgpack
binary      Stata                read_stata     to_stata
binary      SAS                  read_sas    
binary      Python Pickle Format read_pickle    to_pickle
SQL         SQL                  read_sql       to_sql
SQL         Google Big Query     read_gbq       to_gbq

Run Code Online (Sandbox Code Playgroud)

对于中小型文件，我更喜欢 CSV，因为格式正确的 CSV 可以存储任意字符串数据，是人类可读的，并且在实现前两个目标的同时与任何格式一样简单。

有一次，我使用 HDF5，但如果我在亚马逊上，我会考虑使用镶木地板。

使用to_hdf 的示例：

df.to_hdf('tmp.hdf','df', mode='w')
df2 = pd.read_hdf('tmp.hdf','df')

Run Code Online (Sandbox Code Playgroud)

我不再喜欢 HDF5 格式。由于它相当复杂，因此长期存档存在严重风险。它有一个 150 页的规范，只有一个 300,000 行的 C 实现。

相比之下，只要您只使用 Python 工作，pickle 格式就声称具有长期稳定性：

如果选择了兼容的 pickle 协议并且 pickle 和 unpickling 代码处理 Python 2 到 Python 3 的类型差异，则保证 pickle 序列化格式跨 Python 版本向后兼容，如果您的数据跨越了那个独特的突破性更改语言边界。

然而，pickles 允许任意代码执行，因此应该小心处理未知来源的 pickles。

归档时间：	11 年，3 月前
查看次数：	6057 次
最近记录：	7 年，6 月前