小编war*_*ckh的帖子

Pyarrow 在使用 Pandas to_parquet() 时应用模式

我有一个非常宽的数据框（20,000 列），主要由 Pandas 中的 float64 列组成。我想将这些列转换为 float32 并写入 Parquet 格式。我这样做是因为这些文件的下游用户是内存有限的小容器。

我目前在 Pandas 中投射，但这在广泛的数据集上非常慢，然后写出镶木地板。是否可以在写入 to_parquet 过程本身时转换类型？下面显示了一个虚拟示例。

import pandas as pd
import numpy as np
import pyarrow
df = pd.DataFrame(np.random.randn(3000, 15000)) # make dummy data set
df.columns = [str(x) for x in list(df)] # make column names string for parquet
df[list(df.loc[:, df.dtypes == float])] = df[list(df.loc[:, df.dtypes == float])].astype('float32') # cast the data
df.to_parquet("myfile.parquet") # write out the df

Run Code Online (Sandbox Code Playgroud)

python pandas pyarrow

war*_*ckh

lucky-day

5
推荐指数

2
解决办法

1467
查看次数

加速 Pandas 中的滚动窗口

我有这段代码，它工作正常，并为我提供了我正在寻找的结果。它遍历窗口大小列表，为 sum_metric_list、min_metric_list 和 max_metric_list 中的每个指标创建滚动聚合。

# create the rolling aggregations for each window
for window in constants.AGGREGATION_WINDOW:
    # get the sum and count sums
    sum_metrics_names_list = [x[6:] + "_1_" + str(window) for x in sum_metrics_list]
    adt_df[sum_metrics_names_list] = adt_df.groupby('athlete_id')[sum_metrics_list].apply(lambda x : x.rolling(center = False, window = window, min_periods = 1).sum())

    # get the min of mins
    min_metrics_names_list = [x[6:] + "_1_" + str(window) for x in min_metrics_list]
    adt_df[min_metrics_names_list] = adt_df.groupby('athlete_id')[min_metrics_list].apply(lambda x : x.rolling(center = False, window = window, min_periods = 1).min()) …

Run Code Online (Sandbox Code Playgroud)

python pandas pandas-groupby

war*_*ckh

lucky-day

2
推荐指数

1
解决办法

3484
查看次数