R 中 tidyr::complete 的 Python 等效项,允许指定附加值

Pre*_*sto 5 python r pandas dplyr pandas-groupby

我正在寻找重新创建一个 R 脚本,但我一直在思考如何在 Python 中重新创建这个管道。我正在分析不同工厂的累计产量,需要对它们的累计生产时间进行归一化,以便进行比较。

管道看起来像这样:

Norm_hrs <- Cum_df%>%
  group_by(Name)%>%
  complete(Cum_hrs = seq(0,max(Cum_hrs),730.5))
Run Code Online (Sandbox Code Playgroud)

它需要这样:

Name        Cum_Hrs A   B           C
Factory 1   1       0   1.887861    3.775722
Factory 1   251     0   2104.335728 21932.57871
Factory 1   611     0   2324.586178 37498.99722
Factory 1   1208    0   4361.588197 65235.05541
Factory 2   48      0   1517.840244 6604.770432
Factory 2   163     0   3370.461172 17252.70972
Factory 2   822     0   13284.87786 71918.78308
Factory 2   1541    0   21476.93602 134569.0388
Factory 2   2285    0   32053.99192 225895.1477
Factory 2   3028    0   42299.41357 340798.6151
Factory 2   3699    0   50125.85599 462145.5438
Factory 2   4436    0   56715.74945 584474.9989
Run Code Online (Sandbox Code Playgroud)

并把它变成这样:

Name        Cum_Hrs A   B           C
Factory 1   1       0   1.887861    3.775722
Factory 1   251     0   2104.335728 21932.57871
Factory 1   611     0   2324.586178 37498.99722
Factory 1   730.5   NA  NA          NA
Factory 1   1208    0   4361.588197 65235.05541
Factory 2   48      0   1517.840244 6604.770432
Factory 2   163     0   3370.461172 17252.70972
Factory 2   730.5   NA  NA          NA
Factory 2   822     0   13284.87786 71918.78308
Factory 2   1461    NA  NA          NA
Factory 2   1541    0   21476.93602 134569.0388
Factory 2   2091.5  NA  NA          NA
Factory 2   2285    0   32053.99192 225895.1477
Factory 2   2922    NA  NA          NA
Factory 2   3028    0   42299.41357 340798.6151
Run Code Online (Sandbox Code Playgroud)

这反过来又允许我为标准化的时间步长在 DataFrame 中插入 NA 的值

Par*_*ait 2

只需将所有唯一名称的连续数据帧与增量Cum_Hrs值连接起来即可:

seq_df = pd.concat([pd.DataFrame({'Name': i, 'Cum_Hrs': np.arange(0, max(g['Cum_Hrs']), 730.5)})
                     for i,g in df.groupby(['Name'])])

final_df = (pd.concat([df, seq_df], sort=True)
              .sort_values(['Name', 'Cum_Hrs'])
              .reset_index(drop=True)
              .reindex(columns=df.columns)
            )

print(final_df)
#          Name  Cum_Hrs    A             B              C
# 0   Factory 1      0.0  NaN           NaN            NaN
# 1   Factory 1      1.0  0.0      1.887861       3.775722
# 2   Factory 1    251.0  0.0   2104.335728   21932.578710
# 3   Factory 1    611.0  0.0   2324.586178   37498.997220
# 4   Factory 1    730.5  NaN           NaN            NaN
# 5   Factory 1   1208.0  0.0   4361.588197   65235.055410
# 6   Factory 2      0.0  NaN           NaN            NaN
# 7   Factory 2     48.0  0.0   1517.840244    6604.770432
# 8   Factory 2    163.0  0.0   3370.461172   17252.709720
# 9   Factory 2    730.5  NaN           NaN            NaN
# 10  Factory 2    822.0  0.0  13284.877860   71918.783080
# 11  Factory 2   1461.0  NaN           NaN            NaN
# 12  Factory 2   1541.0  0.0  21476.936020  134569.038800
# 13  Factory 2   2191.5  NaN           NaN            NaN
# 14  Factory 2   2285.0  0.0  32053.991920  225895.147700
# 15  Factory 2   2922.0  NaN           NaN            NaN
# 16  Factory 2   3028.0  0.0  42299.413570  340798.615100
# 17  Factory 2   3652.5  NaN           NaN            NaN
# 18  Factory 2   3699.0  0.0  50125.855990  462145.543800
# 19  Factory 2   4383.0  NaN           NaN            NaN
# 20  Factory 2   4436.0  0.0  56715.749450  584474.998900
Run Code Online (Sandbox Code Playgroud)

类似的过程可以在基本 R 中处理。通常将基本 R(非 tidyverse)转换为 Pandas 会更容易:

  • seq==>np.arange
  • by==>pd.DataFrame.groupby
  • data.frame==>pd.DataFrame
  • do.call+ rbind==>pd.concat
  • order==>pd.sort_values
  • row.names=NULL==>pd.reset_index()

seq_df = pd.concat([pd.DataFrame({'Name': i, 'Cum_Hrs': np.arange(0, max(g['Cum_Hrs']), 730.5)})
                     for i,g in df.groupby(['Name'])])

final_df = (pd.concat([df, seq_df], sort=True)
              .sort_values(['Name', 'Cum_Hrs'])
              .reset_index(drop=True)
              .reindex(columns=df.columns)
            )

print(final_df)
#          Name  Cum_Hrs    A             B              C
# 0   Factory 1      0.0  NaN           NaN            NaN
# 1   Factory 1      1.0  0.0      1.887861       3.775722
# 2   Factory 1    251.0  0.0   2104.335728   21932.578710
# 3   Factory 1    611.0  0.0   2324.586178   37498.997220
# 4   Factory 1    730.5  NaN           NaN            NaN
# 5   Factory 1   1208.0  0.0   4361.588197   65235.055410
# 6   Factory 2      0.0  NaN           NaN            NaN
# 7   Factory 2     48.0  0.0   1517.840244    6604.770432
# 8   Factory 2    163.0  0.0   3370.461172   17252.709720
# 9   Factory 2    730.5  NaN           NaN            NaN
# 10  Factory 2    822.0  0.0  13284.877860   71918.783080
# 11  Factory 2   1461.0  NaN           NaN            NaN
# 12  Factory 2   1541.0  0.0  21476.936020  134569.038800
# 13  Factory 2   2191.5  NaN           NaN            NaN
# 14  Factory 2   2285.0  0.0  32053.991920  225895.147700
# 15  Factory 2   2922.0  NaN           NaN            NaN
# 16  Factory 2   3028.0  0.0  42299.413570  340798.615100
# 17  Factory 2   3652.5  NaN           NaN            NaN
# 18  Factory 2   3699.0  0.0  50125.855990  462145.543800
# 19  Factory 2   4383.0  NaN           NaN            NaN
# 20  Factory 2   4436.0  0.0  56715.749450  584474.998900
Run Code Online (Sandbox Code Playgroud)

Rextester 演示