Eri*_*c B 8 python numpy pandas
我有一个pandas.DataFrame
包含列start
和end
列,以及几个额外的列.我想将此数据框扩展为以start
值开始并以值结束的时间序列end
,但复制其他列.到目前为止,我想出了以下内容:
import pandas as pd
import datetime as dt
df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']
data_series = list()
for row in df.itertuples():
time_range = pd.bdate_range(row.start, row.end)
s = len(time_range)
data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))
columns_names = ['date', 'start', 'end', 'country', 'letter']
df = pd.DataFrame(data_series, columns=columns_names)
Run Code Online (Sandbox Code Playgroud)
启动Dataframe:
start end country letter
0 2017-04-03 2017-04-10 US a
1 2017-04-05 2017-04-12 EU b
2 2017-04-10 2017-04-17 UK c
Run Code Online (Sandbox Code Playgroud)
期望的输出:
date start end country letter
0 2017-04-03 2017-04-03 2017-04-10 US a
1 2017-04-04 2017-04-03 2017-04-10 US a
2 2017-04-05 2017-04-03 2017-04-10 US a
3 2017-04-06 2017-04-03 2017-04-10 US a
4 2017-04-07 2017-04-03 2017-04-10 US a
5 2017-04-10 2017-04-03 2017-04-10 US a
6 2017-04-05 2017-04-05 2017-04-12 EU b
7 2017-04-06 2017-04-05 2017-04-12 EU b
8 2017-04-07 2017-04-05 2017-04-12 EU b
9 2017-04-10 2017-04-05 2017-04-12 EU b
10 2017-04-11 2017-04-05 2017-04-12 EU b
11 2017-04-12 2017-04-05 2017-04-12 EU b
12 2017-04-10 2017-04-10 2017-04-17 UK c
13 2017-04-11 2017-04-10 2017-04-17 UK c
14 2017-04-12 2017-04-10 2017-04-17 UK c
15 2017-04-13 2017-04-10 2017-04-17 UK c
16 2017-04-14 2017-04-10 2017-04-17 UK c
17 2017-04-17 2017-04-10 2017-04-17 UK c
Run Code Online (Sandbox Code Playgroud)
我的解决方案的问题在于,当将它应用于更大的数据帧(主要是行数)时,它不能足够快地实现我的结果.有没有人对我如何改进有任何想法?我也在考虑numpy的解决方案.
首先,我们可以构建您需要的日期,同时通过列表跟踪每行中的天数deltas
:
dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')
Run Code Online (Sandbox Code Playgroud)
然后用于np.repeat
建立具有适当段长度的数据矩阵:
df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})
Run Code Online (Sandbox Code Playgroud)
然后将日期插入数据帧的前面:
df2 = pd.concat([dates, df2], axis=1)
Run Code Online (Sandbox Code Playgroud)
测试代码:
import pandas as pd
import numpy as np
import datetime as dt
df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5),
dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12),
dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']
dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')
df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})
df2 = pd.concat([dates, df2], axis=1)
print(df2)
Run Code Online (Sandbox Code Playgroud)
结果:
date start end country letter
0 2017-04-03 2017-04-03 2017-04-10 US a
1 2017-04-04 2017-04-03 2017-04-10 US a
2 2017-04-05 2017-04-03 2017-04-10 US a
3 2017-04-06 2017-04-03 2017-04-10 US a
4 2017-04-07 2017-04-03 2017-04-10 US a
5 2017-04-10 2017-04-03 2017-04-10 US a
6 2017-04-05 2017-04-05 2017-04-12 EU b
7 2017-04-06 2017-04-05 2017-04-12 EU b
8 2017-04-07 2017-04-05 2017-04-12 EU b
9 2017-04-10 2017-04-05 2017-04-12 EU b
10 2017-04-11 2017-04-05 2017-04-12 EU b
11 2017-04-12 2017-04-05 2017-04-12 EU b
12 2017-04-10 2017-04-10 2017-04-17 UK c
13 2017-04-11 2017-04-10 2017-04-17 UK c
14 2017-04-12 2017-04-10 2017-04-17 UK c
15 2017-04-13 2017-04-10 2017-04-17 UK c
16 2017-04-14 2017-04-10 2017-04-17 UK c
17 2017-04-17 2017-04-10 2017-04-17 UK c
Run Code Online (Sandbox Code Playgroud)
灵感来自@ StephenRauch的解决方案,我想发布我的(非常相似):
dates = [pd.bdate_range(r[0],r[1]).to_series() for r in df[['start','end']].values]
lens = [len(x) for x in dates]
r = pd.DataFrame(
{col:np.repeat(df[col].values, lens) for col in df.columns}
).assign(date=np.concatenate(dates))
Run Code Online (Sandbox Code Playgroud)
结果:
In [259]: r
Out[259]:
country end letter start date
0 US 2017-04-10 a 2017-04-03 2017-04-03
1 US 2017-04-10 a 2017-04-03 2017-04-04
2 US 2017-04-10 a 2017-04-03 2017-04-05
3 US 2017-04-10 a 2017-04-03 2017-04-06
4 US 2017-04-10 a 2017-04-03 2017-04-07
5 US 2017-04-10 a 2017-04-03 2017-04-10
6 EU 2017-04-12 b 2017-04-05 2017-04-05
7 EU 2017-04-12 b 2017-04-05 2017-04-06
8 EU 2017-04-12 b 2017-04-05 2017-04-07
9 EU 2017-04-12 b 2017-04-05 2017-04-10
10 EU 2017-04-12 b 2017-04-05 2017-04-11
11 EU 2017-04-12 b 2017-04-05 2017-04-12
12 UK 2017-04-17 c 2017-04-10 2017-04-10
13 UK 2017-04-17 c 2017-04-10 2017-04-11
14 UK 2017-04-17 c 2017-04-10 2017-04-12
15 UK 2017-04-17 c 2017-04-10 2017-04-13
16 UK 2017-04-17 c 2017-04-10 2017-04-14
17 UK 2017-04-17 c 2017-04-10 2017-04-17
Run Code Online (Sandbox Code Playgroud)
Timings + 3另一种解决方案:
#original solution
In [163]: %%timeit
...: data_series = list()
...: for row in df.itertuples():
...: time_range = pd.bdate_range(row.start, row.end)
...: s = len(time_range)
...: data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))
...:
...: columns_names = ['date', 'start', 'end', 'country', 'letter']
...: df3 = pd.DataFrame(data_series, columns=columns_names)
...:
1 loop, best of 3: 634 ms per loop
Run Code Online (Sandbox Code Playgroud)
#Stephen Rauch solution, a bit changed because warnings
In [164]: %%timeit
...: dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
...: for row in df[['start', 'end']].iterrows()]
...: deltas = [len(x) for x in dates]
...: dates = pd.Series(pd.concat(dates).values, name='date')
...: df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
...: df2['start'] = pd.to_datetime(df2['start'])
...: df2['end'] = pd.to_datetime(df2['end'])
...: df2 = pd.concat([dates, df2], axis=1)
...:
1 loop, best of 3: 669 ms per loop
#maxu solution
In [165]: %%timeit
...: dates = [pd.bdate_range(r[0],r[1]).to_series() for r in df[['start','end']].values]
...: lens = [len(x) for x in dates]
...: r = pd.DataFrame(
...: {col:np.repeat(df[col].values, lens) for col in df.columns}
...: ).assign(date=np.concatenate(dates))
...:
1 loop, best of 3: 609 ms per loop
Run Code Online (Sandbox Code Playgroud)
#jezrael solution1
In [166]: %%timeit
...: df1 = pd.concat([pd.Series(r.Index,
...: pd.bdate_range(r.start, r.end))
...: for r in df.itertuples()]).reset_index()
...: df1.columns = ['date','idx']
...: df2 = df1.set_index('idx').join(df).reset_index(drop=True)
...:
1 loop, best of 3: 632 ms per loop
#jezrael solution2 (improved maxu solution)
In [167]: %%timeit
...: dates = [pd.bdate_range(r[0],r[1]) for r in df[['start','end']].values]
...: lens = [len(x) for x in dates]
...:
...: df4 = pd.DataFrame(
...: {col:np.repeat(df[col].values, lens) for col in df.columns}
...: )
...: df4.insert(0, 'date', np.concatenate(dates))
...:
1 loop, best of 3: 583 ms per loop
#jezrael solution 3
In [207]: %%timeit
...: dates = [pd.bdate_range(r[0],r[1]) for r in df[['start','end']].values]
...: lens = [len(x) for x in dates]
...: r = np.repeat(df.index.values, lens)
...: df2 = pd.DataFrame(df.values[r], df.index[r], df.columns).reset_index(drop=True)
...: df2['start'] = pd.to_datetime(df2['start'])
...: df2['end'] = pd.to_datetime(df2['end'])
...: df2.insert(0, 'date', np.concatenate(dates))
...:
1 loop, best of 3: 609 ms per loop
Run Code Online (Sandbox Code Playgroud)
时间代码:
import datetime as dt
df = pd.DataFrame()
N = 100
#N = 1
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]*N
df['end'] = [dt.datetime(2017, 8, 10), dt.datetime(2017, 5, 12), dt.datetime(2017, 5, 17)]*N
df['country'] = ['US', 'EU', 'UK']*N
df['letter'] = ['a', 'b', 'c']*N
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1292 次 |
最近记录: |