根据开始和结束列(速度)扩展数据帧

Eri*_*c B 8 python numpy pandas

我有一个pandas.DataFrame包含列startend列,以及几个额外的列.我想将此数据框扩展为以start值开始并以值结束的时间序列end,但复制其他列.到目前为止,我想出了以下内容:

import pandas as pd
import datetime as dt

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']

data_series = list()
for row in df.itertuples():
    time_range = pd.bdate_range(row.start, row.end)
    s = len(time_range)
    data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))

columns_names = ['date', 'start', 'end', 'country', 'letter']
df = pd.DataFrame(data_series, columns=columns_names)
Run Code Online (Sandbox Code Playgroud)

启动Dataframe:

       start        end country letter
0 2017-04-03 2017-04-10      US      a
1 2017-04-05 2017-04-12      EU      b
2 2017-04-10 2017-04-17      UK      c
Run Code Online (Sandbox Code Playgroud)

期望的输出:

         date      start        end country letter
0  2017-04-03 2017-04-03 2017-04-10      US      a
1  2017-04-04 2017-04-03 2017-04-10      US      a
2  2017-04-05 2017-04-03 2017-04-10      US      a
3  2017-04-06 2017-04-03 2017-04-10      US      a
4  2017-04-07 2017-04-03 2017-04-10      US      a
5  2017-04-10 2017-04-03 2017-04-10      US      a
6  2017-04-05 2017-04-05 2017-04-12      EU      b
7  2017-04-06 2017-04-05 2017-04-12      EU      b
8  2017-04-07 2017-04-05 2017-04-12      EU      b
9  2017-04-10 2017-04-05 2017-04-12      EU      b
10 2017-04-11 2017-04-05 2017-04-12      EU      b
11 2017-04-12 2017-04-05 2017-04-12      EU      b
12 2017-04-10 2017-04-10 2017-04-17      UK      c
13 2017-04-11 2017-04-10 2017-04-17      UK      c
14 2017-04-12 2017-04-10 2017-04-17      UK      c
15 2017-04-13 2017-04-10 2017-04-17      UK      c
16 2017-04-14 2017-04-10 2017-04-17      UK      c
17 2017-04-17 2017-04-10 2017-04-17      UK      c
Run Code Online (Sandbox Code Playgroud)

我的解决方案的问题在于,当将它应用于更大的数据帧(主要是行数)时,它不能足够快地实现我的结果.有没有人对我如何改进有任何想法?我也在考虑numpy的解决方案.

Ste*_*uch 8

首先,我们可以构建您需要的日期,同时通过列表跟踪每行中的天数deltas:

dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
         for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')
Run Code Online (Sandbox Code Playgroud)

然后用于np.repeat建立具有适当段长度的数据矩阵:

df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})
Run Code Online (Sandbox Code Playgroud)

然后将日期插入数据帧的前面:

df2 = pd.concat([dates, df2], axis=1)
Run Code Online (Sandbox Code Playgroud)

测试代码:

import pandas as pd
import numpy as np
import datetime as dt

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5),
               dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12),
             dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']

dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
         for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')

df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})
df2 = pd.concat([dates, df2], axis=1)
print(df2)
Run Code Online (Sandbox Code Playgroud)

结果:

         date      start        end country letter
0  2017-04-03 2017-04-03 2017-04-10      US      a
1  2017-04-04 2017-04-03 2017-04-10      US      a
2  2017-04-05 2017-04-03 2017-04-10      US      a
3  2017-04-06 2017-04-03 2017-04-10      US      a
4  2017-04-07 2017-04-03 2017-04-10      US      a
5  2017-04-10 2017-04-03 2017-04-10      US      a
6  2017-04-05 2017-04-05 2017-04-12      EU      b
7  2017-04-06 2017-04-05 2017-04-12      EU      b
8  2017-04-07 2017-04-05 2017-04-12      EU      b
9  2017-04-10 2017-04-05 2017-04-12      EU      b
10 2017-04-11 2017-04-05 2017-04-12      EU      b
11 2017-04-12 2017-04-05 2017-04-12      EU      b
12 2017-04-10 2017-04-10 2017-04-17      UK      c
13 2017-04-11 2017-04-10 2017-04-17      UK      c
14 2017-04-12 2017-04-10 2017-04-17      UK      c
15 2017-04-13 2017-04-10 2017-04-17      UK      c
16 2017-04-14 2017-04-10 2017-04-17      UK      c
17 2017-04-17 2017-04-10 2017-04-17      UK      c
Run Code Online (Sandbox Code Playgroud)


Max*_*axU 7

灵感来自@ StephenRauch的解决方案,我想发布我的(非常相似):

dates = [pd.bdate_range(r[0],r[1]).to_series() for r in df[['start','end']].values]
lens = [len(x) for x in dates]

r = pd.DataFrame(
        {col:np.repeat(df[col].values, lens) for col in df.columns}
    ).assign(date=np.concatenate(dates))
Run Code Online (Sandbox Code Playgroud)

结果:

In [259]: r
Out[259]:
   country        end letter      start       date
0       US 2017-04-10      a 2017-04-03 2017-04-03
1       US 2017-04-10      a 2017-04-03 2017-04-04
2       US 2017-04-10      a 2017-04-03 2017-04-05
3       US 2017-04-10      a 2017-04-03 2017-04-06
4       US 2017-04-10      a 2017-04-03 2017-04-07
5       US 2017-04-10      a 2017-04-03 2017-04-10
6       EU 2017-04-12      b 2017-04-05 2017-04-05
7       EU 2017-04-12      b 2017-04-05 2017-04-06
8       EU 2017-04-12      b 2017-04-05 2017-04-07
9       EU 2017-04-12      b 2017-04-05 2017-04-10
10      EU 2017-04-12      b 2017-04-05 2017-04-11
11      EU 2017-04-12      b 2017-04-05 2017-04-12
12      UK 2017-04-17      c 2017-04-10 2017-04-10
13      UK 2017-04-17      c 2017-04-10 2017-04-11
14      UK 2017-04-17      c 2017-04-10 2017-04-12
15      UK 2017-04-17      c 2017-04-10 2017-04-13
16      UK 2017-04-17      c 2017-04-10 2017-04-14
17      UK 2017-04-17      c 2017-04-10 2017-04-17
Run Code Online (Sandbox Code Playgroud)


jez*_*ael 6

Timings + 3另一种解决方案:

#original solution 
In [163]: %%timeit
     ...: data_series = list()
     ...: for row in df.itertuples():
     ...:     time_range = pd.bdate_range(row.start, row.end)
     ...:     s = len(time_range)
     ...:     data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))
     ...: 
     ...: columns_names = ['date', 'start', 'end', 'country', 'letter']
     ...: df3 = pd.DataFrame(data_series, columns=columns_names)
     ...: 
1 loop, best of 3: 634 ms per loop
Run Code Online (Sandbox Code Playgroud)
#Stephen Rauch solution, a bit changed because warnings
In [164]: %%timeit
     ...: dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
     ...:          for row in df[['start', 'end']].iterrows()]
     ...: deltas = [len(x) for x in dates]
     ...: dates = pd.Series(pd.concat(dates).values, name='date')
     ...: df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
     ...: df2['start'] = pd.to_datetime(df2['start'])
     ...: df2['end'] = pd.to_datetime(df2['end'])
     ...: df2 = pd.concat([dates, df2], axis=1)
     ...: 
1 loop, best of 3: 669 ms per loop

#maxu solution
In [165]: %%timeit
     ...: dates = [pd.bdate_range(r[0],r[1]).to_series() for r in df[['start','end']].values]
     ...: lens = [len(x) for x in dates]
     ...: r = pd.DataFrame(
     ...:         {col:np.repeat(df[col].values, lens) for col in df.columns}
     ...:     ).assign(date=np.concatenate(dates))
     ...: 
1 loop, best of 3: 609 ms per loop
Run Code Online (Sandbox Code Playgroud)
#jezrael solution1
In [166]: %%timeit
     ...: df1 = pd.concat([pd.Series(r.Index, 
     ...:                            pd.bdate_range(r.start, r.end)) 
     ...:                            for r in df.itertuples()]).reset_index()
     ...: df1.columns = ['date','idx']
     ...: df2 = df1.set_index('idx').join(df).reset_index(drop=True)
     ...: 
1 loop, best of 3: 632 ms per loop

#jezrael solution2 (improved maxu solution)
In [167]: %%timeit
     ...: dates = [pd.bdate_range(r[0],r[1]) for r in df[['start','end']].values]
     ...: lens = [len(x) for x in dates]
     ...: 
     ...: df4 = pd.DataFrame(
     ...:         {col:np.repeat(df[col].values, lens) for col in df.columns}
     ...:     )
     ...: df4.insert(0, 'date', np.concatenate(dates))
     ...: 
1 loop, best of 3: 583 ms per loop
#jezrael solution 3
In [207]: %%timeit
     ...: dates = [pd.bdate_range(r[0],r[1]) for r in df[['start','end']].values]
     ...: lens = [len(x) for x in dates]
     ...: r = np.repeat(df.index.values, lens)
     ...: df2 = pd.DataFrame(df.values[r], df.index[r], df.columns).reset_index(drop=True)
     ...: df2['start'] = pd.to_datetime(df2['start'])
     ...: df2['end'] = pd.to_datetime(df2['end'])
     ...: df2.insert(0, 'date', np.concatenate(dates))
     ...: 
1 loop, best of 3: 609 ms per loop
Run Code Online (Sandbox Code Playgroud)

时间代码:

import datetime as dt

df = pd.DataFrame()
N = 100
#N = 1
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]*N
df['end'] = [dt.datetime(2017, 8, 10), dt.datetime(2017, 5, 12), dt.datetime(2017, 5, 17)]*N
df['country'] = ['US', 'EU', 'UK']*N
df['letter'] = ['a', 'b', 'c']*N
Run Code Online (Sandbox Code Playgroud)