我有一些DataFrame,我想按ID分组,例如:
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'], 'user_id': [1,2,1,1,3,1,5]})
print df
Run Code Online (Sandbox Code Playgroud)
哪个产生:
item_id user_id
0 a 1
1 a 2
2 b 1
3 b 1
4 b 3
5 c 1
6 d 5
[7 rows x 2 columns]
Run Code Online (Sandbox Code Playgroud)
我可以轻松地按ID分组:
grouped = df.groupby("item_id")
Run Code Online (Sandbox Code Playgroud)
但是,我怎样才能只返回前N个分组对象?E. g.我只想要前3个唯一的item_ids.
我正在尝试使用python logging模块,但在这里有点困惑.下面是一个标准的脚本来创建一个logger第一,然后创建并添加file handler和console handler到logger.
import logging
logger = logging.getLogger('logging_test')
logger.setLevel(logging.DEBUG)
print(len(logger.handlers)) # output: 0
# create file handler which logs even debug messages
fh = logging.FileHandler('/home/Jian/Downloads/spam.log', mode='w')
fh.setLevel(logging.DEBUG)
# create console handler with a higher log level
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
fh.setFormatter(formatter)
# add the handlers to logger
logger.addHandler(ch)
logger.addHandler(fh)
print(len(logger.handlers)) # …Run Code Online (Sandbox Code Playgroud) 我想找到一种方法,pandas.tseries.offsets以1秒的频率为交易时间构建一个自定义类.这里的主要要求是时间偏移对象足够聪明,知道"2015-06-18 16:00:00"的下一秒将是'2015-06-19 09:30:00或09:30: 01',从这两个时间戳计算的时间增量将精确为1秒(自定义偏移量1s,类似于BDay(1)工作日频率),而不是关闭时间的持续时间.
原因是当在几个交易日内绘制pd.Series的日内数据时,请看下面的模拟示例,在收盘价和次日开盘价之间有很多"阶梯线"(线性插值)来表示持续时间.关闭时间.有没有办法摆脱这个?我查看源代码pandas.tseries.offsets并查找pd.tseries.offsets.BusinessHour并pd.tseries.offsets.BusinessMixin可能有所帮助,但我不知道如何使用它们.
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
# set as 'constant' object shared by all codes in this script
BDAY_US = CustomBusinessDay(calender=USFederalHolidayCalendar())
sample_freq = '5min'
dates = pd.date_range(start='2015-01-01', end='2015-01-31', freq=BDAY_US).date
# exculde the 09:30:00 as it is included in the first time bucket
times = pd.date_range(start='09:30:00', end='16:00:00', freq=sample_freq).time[1:]
time_stamps = [dt.datetime.combine(date, time) for date in dates …Run Code Online (Sandbox Code Playgroud) 不确定这是否是一个错误,但pd.tseries.offsets.MonthOffset()似乎提供了错误的结果。它增加日而不是月。
import pandas as pd
ts = pd.Timestamp('2015-07-15')
print(ts)
2015-07-15 00:00:00
ts1 = ts + pd.tseries.offsets.MonthOffset(1)
print(ts1)
2015-07-16 00:00:00
Run Code Online (Sandbox Code Playgroud)