小编Ben*_*rey的帖子

如何将大量数据附加到Pandas HDFStore并获得自然的唯一索引？

我正在将大量的http日志(80GB +)导入到Pandas HDFStore中进行统计处理.即使在单个导入文件中,我也需要在加载内容时批量处理内容.到目前为止,我的策略是将解析后的行读入DataFrame,然后将DataFrame存储到HDFStore中.我的目标是让索引键对DataStore中的单个键唯一,但每个DataFrame再次重新启动它自己的索引值.我期待HDFStore.append()会有一些机制告诉它忽略DataFrame索引值,只是继续添加到我的HDFStore键的现有索引值,但似乎无法找到它.如何在HDFStore增加其现有索引值的同时导入DataFrame并忽略其中包含的索引值？以下示例代码每10行批处理.当然,真实的东西会更大.

if hd_file_name:
        """
        HDF5 output file specified.
        """

        hdf_output = pd.HDFStore(hd_file_name, complib='blosc')
        print hdf_output

        columns = ['source', 'ip', 'unknown', 'user', 'timestamp', 'http_verb', 'path', 'protocol', 'http_result', 
                   'response_size', 'referrer', 'user_agent', 'response_time']

        source_name = str(log_file.name.rsplit('/')[-1])   # HDF5 Tables don't play nice with unicode so explicit str(). :(

        batch = []

        for count, line in enumerate(log_file,1):
            data = parse_line(line, rejected_output = reject_output)

            # Add our source file name to the beginning.
            data.insert(0, source_name )    
            batch.append(data)

            if not (count % 10):
                df …

Run Code Online (Sandbox Code Playgroud)

python indexing dataframe pandas hdfstore

Ben*_*rey

2014 03-19

13
推荐指数

1
解决办法

1万
查看次数