asa*_*o23 5 python time-series pandas
我正在尝试将带有时间戳索引的pandas数据帧重新采样为每小时一次.我有兴趣获得具有字符串值的列的最常用值.然而,时间序列重新采样的内置函数不包括模式作为重新采样的默认方法之一(因为它"意味着"和"计数").
我试图定义自己的函数并传递该函数但不起作用.我也尝试过使用该np.bincount功能,但由于我正在处理字符串,所以它不起作用.
以下是我的数据的外观:
station_arrived action lat1 lon1
date_removed
2012-01-01 13:12:00 56 A 19.4171 -99.16561
2012-01-01 13:12:00 56 A 19.4271 -99.16361
2012-01-01 15:41:00 56 A 19.4171 -99.16561
2012-01-02 08:41:00 56 C 19.4271 -99.16561
2012-01-02 11:36:00 56 C 19.2171 -99.16561
Run Code Online (Sandbox Code Playgroud)
到目前为止这是我的代码:
def mode1(algo):
common=[ite for ite, it in Counter(algo).most_common(1)]
# Returns all unique items and their counts
return common
hourlycount2 = travels2012.resample('H', how={'station_arrived': 'count',
'action': mode(travels2012['action']),
'lat1':'count', 'lon1':'count'})
hourlycount2.head()
Run Code Online (Sandbox Code Playgroud)
我看到以下错误:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\generic.py", line 2836, in resample
return sampler.resample(self).__finalize__(self)
File "C:\Program Files\Anaconda\lib\site-packages\pandas\tseries\resample.py", line 83, in resample
rs = self._resample_timestamps()
File "C:\Program Files\Anaconda\lib\site-packages\pandas\tseries\resample.py", line 277, in _resample_timestamps
result = grouped.aggregate(self._agg_method)
File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2404, in aggregate
result[col] = colg.aggregate(agg_how)
File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2076, in aggregate
ret = self._aggregate_multiple_funcs(func_or_funcs)
File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2125, in _aggregate_multiple_funcs
results[name] = self.aggregate(func)
File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 2073, in aggregate
return getattr(self, func_or_funcs)(*args, **kwargs)
File "C:\Program Files\Anaconda\lib\site-packages\pandas\core\groupby.py", line 486, in __getattr__
(type(self).__name__, attr))
AttributeError: 'SeriesGroupBy' object has no attribute 'A '
Run Code Online (Sandbox Code Playgroud)
字典中的值必须是表示函数的字符串(例如“count”/“sum”/“max”)或传递给每个组的函数。您传递的是结果(值)mode(travels2012['action'])。
因此,您需要将其设为一个函数,应用于每个组:
In [11]: df.resample('H', how={'station_arrived':'count',
'action': lambda x: mode(df['action']),
'lat1':'count', 'lon1':'count'})
Out[11]:
action station_arrived lon1 lat1
date_removed
2012-01-01 13:00:00 [A] 2 2 2
2012-01-01 14:00:00 [A] 0 0 0
2012-01-01 15:00:00 [A] 1 1 1
2012-01-01 16:00:00 [A] 0 0 0
...
Run Code Online (Sandbox Code Playgroud)
我不确定这是否是您想要的(因为它适用于整个列),也许您想采用每个组的模式:
In [12]: df.resample('H', how={'station_arrived':'count',
'action': mode, 'lat1':'count', 'lon1':'count'})
Out[12]:
action station_arrived lon1 lat1
date_removed
2012-01-01 13:00:00 [A] 2 2 2
2012-01-01 14:00:00 [] 0 0 0
2012-01-01 15:00:00 [A] 1 1 1
2012-01-01 16:00:00 [] 0 0 0
...
Run Code Online (Sandbox Code Playgroud)
我更愿意看到实际值 (A) 而不是列表中的值,以及 NaN 而不是 []。
我认为值得一提的是 Series 模式方法,它有一个警告,它总是返回一个 Series(因为可能有平局),并且如果没有值出现多次则为空。
您可以按如下方式包装它(您也可以类似地包装您的模式函数):
def mode_(s):
try:
return s.mode()[0]
except IndexError:
return np.nan
In [22]: df.resample('H', how={'station_arrived':'count',
'action': mode_, 'lat1':'count', 'lon1':'count'})
Out[22]:
action station_arrived lon1 lat1
date_removed
2012-01-01 13:00:00 A 2 2 2
2012-01-01 14:00:00 NaN 0 0 0
2012-01-01 15:00:00 NaN 1 1 1
2012-01-01 16:00:00 NaN 0 0 0
...
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2062 次 |
| 最近记录: |