在执行一些其他操作的同时将数据帧重采样为新数据帧

Blu*_*ero 6 python time-series pandas

我正在使用一个数据框,其中每个条目(行)都带有开始时间,持续时间和其他属性。我想从此数据库中创建一个新的数据框,在该数据框中,我会将每个条目从原始条目转换为15分钟间隔,同时保持所有其他属性不变。新数据帧中旧条目中的每个条目的数量将取决于原始条目的实际持续时间。

最初,我尝试使用pd.resample,但它并没有达到我的预期。然后,我使用itertuples()该函数构造了一个效果很好的函数,但花费了大约半小时的时间,并获得了约3000行的数据帧。现在我想对200万行执行相同的操作,因此我正在寻找其他可能性。

假设我有以下数据框:

testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm'], 'id': [1,2,3,4]}
testdf = pd.DataFrame(testdict)
testdf.loc[:,['start']] = pd.to_datetime(testdf['start'])
print(testdf)

>>>testdf
                 start  duration Attribute_A  id
0  2018-01-05 11:48:00        22         abc   1
1  2018-05-04 09:05:00         8         def   2
2  2018-08-09 07:15:00        35         hij   3
3  2018-09-27 15:00:00         2         klm   4
Run Code Online (Sandbox Code Playgroud)

我希望我的结果如下所示:

>>>resultdf
                start  duration Attribute_A  id
0 2018-01-05 11:45:00        12         abc   1
1 2018-01-05 12:00:00        10         abc   1
2 2018-05-04 09:00:00         8         def   2
3 2018-08-09 07:15:00        15         hij   3
4 2018-08-09 07:30:00        15         hij   3
5 2018-08-09 07:45:00         5         hij   3
6 2018-09-27 15:00:00         2         klm   4
Run Code Online (Sandbox Code Playgroud)

这是我使用itertuples构建的函数,该函数产生了预期的结果(我在此上方显示了该结果):

def min15_divider(df,newdf):
for row in df.itertuples():
    orig_min = row.start.minute
    remains = orig_min % 15 # Check if it is already a multiple of 15
    if remains == 0:
        new_time = row.start.replace(second=0)
        if row.duration < 15: # if it shorter than 15 min just use that for the duration
            to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,
                         'duration': row.duration, 'id':row.id}
            newdf = newdf.append(to_append, ignore_index=True)
        else: # if not, divide that in 15 min intervals until duration is exceeded
            cumu_dur = 15
            while cumu_dur < row.duration:
                to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
                if cumu_dur < 15:
                    to_append['duration'] = cumu_dur
                else:
                    to_append['duration'] = 15
                new_time = new_time + pd.Timedelta('15 minutes')
                cumu_dur = cumu_dur + 15
                newdf = newdf.append(to_append, ignore_index=True)

            else: # add the remainder in the last 15 min interval
                final_dur = row.duration - (cumu_dur - 15)
                to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,'duration': final_dur, 'id':row.id}
                newdf = newdf.append(to_append, ignore_index=True)

    else: # When it is not an exact multiple of 15 min
        new_min = orig_min - remains # convert to multiple of 15
        new_time = row.start.replace(minute=new_min)
        new_time = new_time.replace(second=0)
        cumu_dur = 15 - remains # remaining minutes in the initial interval
        while cumu_dur < row.duration: # divide total in 15 min intervals until duration is exceeded
            to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
            if cumu_dur < 15:
                to_append['duration'] = cumu_dur
            else:
                to_append['duration'] = 15

            new_time = new_time + pd.Timedelta('15 minutes')
            cumu_dur = cumu_dur + 15
            newdf = newdf.append(to_append, ignore_index=True)

        else: # when we reach the last interval or the starting duration was less than the remaining minutes
            if row.duration < 15:
                final_dur = row.duration # original duration less than remaining minutes in first interval
            else:
                final_dur = row.duration - (cumu_dur - 15) # remaining duration in last interval
            to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'duration': final_dur, 'id':row.id}
            newdf = newdf.append(to_append, ignore_index=True)
return newdf
Run Code Online (Sandbox Code Playgroud)

有没有其他方法可以不花itertuples时间节省时间呢?

提前致谢。

PS。我为我的帖子中可能看起来有些怪异的内容表示歉意,因为这是我第一次在stackoverflow中问自己一个问题。

编辑

许多条目可能具有相同的开始时间,因此.groupby“开始”可能会出现问题。但是,对于每个条目,都有一列具有唯一值的列,简称为“ id”。

Val*_*ino 0

使用pd.resample是一个好主意,但由于您只有每行的开始时间,因此您需要先构建结束行,然后才能使用它。

下面的代码假设列中的每个开始时间'start'都是唯一的,因此grouby可以以一种有点不寻常的方式使用,因为它只会提取一行。
我使用它是groupby因为它会自动重新组合apply.
另请注意,该列'duration'会转换为timedelta以分钟为单位,以便稍后更好地执行一些数学运算。

import pandas as pd

testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']}
testdf = pd.DataFrame(testdict)
testdf['start'] = pd.to_datetime(testdf['start'])
testdf['duration'] = pd.to_timedelta(testdf['duration'], 'T')
print(testdf)

def calcduration(df, starttime):
    if len(df) == 1:
        return
    elif len(df) == 2:
        df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
        df['duration'].iloc[1] = df['duration'].iloc[1] - df['duration'].iloc[0]
    elif len(df) > 2:
        df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
        df['duration'].iloc[1:-1] = pd.Timedelta(15, 'T')
        df['duration'].iloc[-1] = df['duration'].iloc[-1] - df['duration'].iloc[:-1].sum()

def expandtime(x):
    frow = x.copy()
    frow['start'] = frow['start'] + frow['duration']
    gdf = pd.concat([x, frow], axis=0)
    gdf = gdf.set_index('start')
    resdf = gdf.resample('15T').nearest()
    calcduration(resdf, x['start'].iloc[0])
    return resdf

findf = testdf.groupby('start', as_index=False).apply(expandtime)
print(findf)
Run Code Online (Sandbox Code Playgroud)

该代码产生:

                      duration Attribute_A
  start                                   
0 2018-01-05 11:45:00 00:12:00         abc
  2018-01-05 12:00:00 00:10:00         abc
1 2018-05-04 09:00:00 00:08:00         def
2 2018-08-09 07:15:00 00:15:00         hij
  2018-08-09 07:30:00 00:15:00         hij
  2018-08-09 07:45:00 00:05:00         hij
3 2018-09-27 15:00:00 00:02:00         klm
Run Code Online (Sandbox Code Playgroud)

一点解释

expandtime是第一个自定义函数。它采用一行数据帧(因为我们假设'start'值是唯一的),构建第二行,其'start'等于'start'第一行+持续时间,然后resample以 15 分钟的时间间隔对其进行采样。所有其他列的值都是重复的。

calcduration用于对列进行一些数学计算,'duration'以计算每行的正确持续时间。