相关疑难解决方法(0)

将 dask.array 列添加到 dask.dataframe

我有一个 dask 数据框和一个 dask 数组，它们的行数相同，逻辑顺序相同。数据帧行由字符串索引。我正在尝试将数组列之一添加到数据框中。我尝试了几种方法，但都以它们特定的方式失败了。

df['col'] = da.col
# TypeError: Column assignment doesn't support type Array

df['col'] = da.to_frame(columns='col')
# TypeError: '<' not supported between instances of 'str' and 'int'

df['col'] = da.to_frame(columns=['col']).set_index(df.col).col
# TypeError: '<' not supported between instances of 'str' and 'int'

df = df.reset_index()
df['col'] = da.to_frame(columns='col')
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

Run Code Online (Sandbox Code Playgroud)

和其他一些变体。

当结构在逻辑上兼容时，将 dask 数组列添加到 dask 数据帧的正确方法是什么？

python dataframe dask

Dan*_*ler

2020 08-09

8
推荐指数

1
解决办法

2624
查看次数

在dask中改组数据

这是Subsetting Dask DataFrames的后续问题.我希望在将数据批量发送到ML算法之前对来自dask数据帧的数据进行混洗.

该问题的答案是做以下事情:

for part in df.repartition(npartitions=100).to_delayed():
    batch = part.compute()

Run Code Online (Sandbox Code Playgroud)

然而,即使我要改变批次的内容,我也有点担心它可能不太理想.数据是一个时间序列集,因此数据点在每个分区内高度相关.

理想情况下我喜欢的是:

rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]

Run Code Online (Sandbox Code Playgroud)

哪个适用于熊猫,但不适用于dask.有什么想法吗？

编辑1:潜在的解决方案

我试过了

train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]

Run Code Online (Sandbox Code Playgroud)

但是,如果我尝试这样做,则train_df.loc[:5,:].compute()返回一个124451行数据帧.所以显然使用dask错了.

python dask

sac*_*ruk

2017 10-20

7
推荐指数

1
解决办法

1136
查看次数