将pandas时间序列从对象dtype重新索引到datetime dtype

Bri*_*gan 30 python datetime python-2.7 pandas

我有一个时间序列,虽然被标准的YYYY-MM-DD字符串索引并且有效日期,但不被识别为DatetimeIndex.将它们强制转换为有效的DatetimeIndex似乎不够优雅,让我觉得我做错了什么.

我读入了(其他人的懒惰格式化)包含无效日期时间值的数据并删除了这些无效的观察结果.

In [1]: df = pd.read_csv('data.csv',index_col=0)
In [2]: print df['2008-02-27':'2008-03-02']
Out[2]: 
             count
2008-02-27  20
2008-02-28   0
2008-02-29  27
2008-02-30   0
2008-02-31   0
2008-03-01   0
2008-03-02  17

In [3]: def clean_timestamps(df):
    # remove invalid dates like '2008-02-30' and '2009-04-31'
    to_drop = list()
    for d in df.index:
        try:
            datetime.date(int(d[0:4]),int(d[5:7]),int(d[8:10]))
        except ValueError:
            to_drop.append(d)
    df2 = df.drop(to_drop,axis=0)
    return df2

In [4]: df2 = clean_timestamps(df)
In [5] :print df2['2008-02-27':'2008-03-02']
Out[5]:
             count
2008-02-27  20
2008-02-28   0
2008-02-29  27
2008-03-01   0
2008-03-02  17
Run Code Online (Sandbox Code Playgroud)

这个新索引仍然只被识别为'对象'dtype而不是DatetimeIndex.

In [6]: df2.index
Out[6]: Index([2008-01-01, 2008-01-02, 2008-01-03, ..., 2012-11-27, 2012-11-28,
   2012-11-29], dtype=object)
Run Code Online (Sandbox Code Playgroud)

重新索引产生NaN,因为它们是不同的dtypes.

In [7]: i = pd.date_range(start=min(df2.index),end=max(df2.index))
In [8]: df3 = df2.reindex(index=i,columns=['count'])
In [9]: df3['2008-02-27':'2008-03-02']
Out[9]: 
            count
2008-02-27 NaN
2008-02-28 NaN
2008-02-29 NaN
2008-03-01 NaN
2008-03-02 NaN
Run Code Online (Sandbox Code Playgroud)

我使用适当的索引创建一个新的数据框,将数据删除到字典,然后根据字典值填充新的数据框(跳过缺失的值).

In [10]: df3 = pd.DataFrame(columns=['count'],index=i)
In [11]: values = dict(df2['count'])
In [12]: for d in i:
    try:
        df3.set_value(index=d,col='count',value=values[d.isoformat()[0:10]])
    except KeyError:
        pass
In [13]: print df3['2008-02-27':'2008-03-02']
Out[13]: 

             count
2008-02-27  20
2008-02-28   0
2008-02-29  27
2008-03-01   0
2008-03-02  17

In [14]: df3.index
Out[14];
<class 'pandas.tseries.index.DatetimeIndex'>
[2008-01-01 00:00:00, ..., 2012-11-29 00:00:00]
Length: 1795, Freq: D, Timezone: None
Run Code Online (Sandbox Code Playgroud)

基于对字符串键入的字典的查找设置值的最后一部分看起来特别hacky并且让我觉得我错过了一些重要的东西.

And*_*den 45

你可以使用pd.to_datetime:

In [1]: import pandas as pd

In [2]: pd.to_datetime('2008-02-27')
Out[2]: datetime.datetime(2008, 2, 27, 0, 0)
Run Code Online (Sandbox Code Playgroud)

这允许您通过将索引应用于Series来"清理"索引(或类似的列):

df.index = pd.to_datetime(df.index)
Run Code Online (Sandbox Code Playgroud)

要么

df['date_col'] = df['date_col'].apply(pd.to_datetime)
Run Code Online (Sandbox Code Playgroud)

  • @Nck你想把它解析成一个Period(如果有freq,希望是一个PeriodIndex).该格式直接用Period构造函数解析:`pd.Period('1996Q1')`.http://pandas.pydata.org/pandas-docs/stable/timeseries.html#period (2认同)