假设我有一个MultiIndex,它包含日期和一些类别(以下示例中的一个简单),对于每个类别,我都有一个具有某个过程值的时间序列.有观察时我只有一个值,而我现在想在那个日期没有观察时添加"0".我找到了一种看似非常低效的方法(堆叠和取消堆叠,在数百万个类别的情况下会创建许多列).
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x)
for x in range(days)]
df = pd.DataFrame([
(datetime.date(2013, 2, 10), 1, 4),
(datetime.date(2013, 2, 10), 2, 7),
(datetime.date(2013, 2, 11), 2, 7),
(datetime.date(2013, 2, 13), 1, 2),
(datetime.date(2013, 2, 13), 2, 3)],
columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
print df
print df.unstack().reindex(all_dates).fillna(0).stack()
# insert 0 values for missing dates
print all_dates
value
date category
2013-02-10 1 4
2 7
2013-02-11 2 7
2013-02-13 1 2
2 3
value
category
2013-02-13 1 2
2 3
2013-02-12 1 0
2 0
2013-02-11 1 0
2 7
2013-02-10 1 4
2 7
[datetime.date(2013, 2, 13), datetime.date(2013, 2, 12),
datetime.date(2013, 2, 11), datetime.date(2013, 2, 10)]
Run Code Online (Sandbox Code Playgroud)
有没有人知道更聪明的方法来实现同样的目标?
编辑:我发现了另一种实现相同的可能性:
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)]
df = pd.DataFrame([(datetime.date(2013, 2, 10), 1, 4, 5),
(datetime.date(2013, 2, 10), 2,1, 7),
(datetime.date(2013, 2, 10), 2,2, 7),
(datetime.date(2013, 2, 11), 2,3, 7),
(datetime.date(2013, 2, 13), 1,4, 2),
(datetime.date(2013, 2, 13), 2,4, 3)],
columns = ['date', 'category', 'cat2', 'value'])
date_col = 'date'
other_index = ['category', 'cat2']
index = [date_col] + other_index
df.set_index(index, inplace=True)
grouped = df.groupby(level=other_index)
df_list = []
for i, group in grouped:
df_list.append(group.reset_index(level=other_index).reindex(all_dates).fillna(0))
print pd.concat(df_list).set_index(other_index, append=True)
value
category cat2
2013-02-13 1 4 2
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 1 4 5
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 2 1 7
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 2 2 7
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 2 3 7
2013-02-10 0 0 0
2013-02-13 2 4 3
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 0 0 0
Run Code Online (Sandbox Code Playgroud)
您可以根据所需索引级别的笛卡尔积生成新的多索引.然后,使用新索引重新索引数据框.
(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
new_df = df.reindex(new_index)
# Optional: convert missing values to zero, and convert the data back
# to integers. See explanation below.
new_df = new_df.fillna(0).astype(int)
Run Code Online (Sandbox Code Playgroud)
而已!新数据框具有所有可能的索引值.现有数据已正确编入索引.
继续阅读以获得更详细的解释.
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [dt.date(2013, 2, 13) - dt.timedelta(days=x)
for x in range(days)]
df = pd.DataFrame([
(dt.date(2013, 2, 10), 1, 4),
(dt.date(2013, 2, 10), 2, 7),
(dt.date(2013, 2, 11), 2, 7),
(dt.date(2013, 2, 13), 1, 2),
(dt.date(2013, 2, 13), 2, 3)],
columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
Run Code Online (Sandbox Code Playgroud)
以下是示例数据的样子
value
date category
2013-02-10 1 4
2 7
2013-02-11 2 7
2013-02-13 1 2
2 3
Run Code Online (Sandbox Code Playgroud)
使用from_product,我们可以创建一个新的多索引.这个新索引是传递给函数的所有值的笛卡尔积.
(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
Run Code Online (Sandbox Code Playgroud)
使用新索引重新索引现有数据框.
现在存在所有可能的组合.缺失值为null(NaN).
new_df = df.reindex(new_index)
Run Code Online (Sandbox Code Playgroud)
现在,扩展的重新索引数据框如下所示:
value
2013-02-13 1 2.0
2 3.0
2013-02-12 1 NaN
2 NaN
2013-02-11 1 NaN
2 7.0
2013-02-10 1 4.0
2 7.0
Run Code Online (Sandbox Code Playgroud)
您可以看到新数据框中的数据已从整数转换为浮点数.Pandas在整数列中不能有空值.或者,我们可以将所有空值转换为0,并将数据转换回整数.
new_df = new_df.fillna(0).astype(int)
Run Code Online (Sandbox Code Playgroud)
结果
value
2013-02-13 1 2
2 3
2013-02-12 1 0
2 0
2013-02-11 1 0
2 7
2013-02-10 1 4
2 7
Run Code Online (Sandbox Code Playgroud)
查看这个答案:How to fill the Missing record of Pandas dataframe in pythonic way?
你可以这样做:
import datetime
import pandas as pd
#make an empty dataframe with the index you want
def get_datetime(x):
return datetime.date(2013, 2, 13)- datetime.timedelta(days=x)
all_dates = [ get_datetime(x) for x in range(4)]
categories = [1,2,3,4]
index = [ [date, cat] for cat in categories for date in all_dates ]
#this df will be just an index
df = pd.DataFrame(index)
df =print df.set_index([0,1])
df.columns = ['date', 'category']
df = df.set_index(['date', 'category'])
#now if your original df is called df_original you can reindex against the other values
df_orig = df_orig.reindex_axis(df.index)
#and to add zeros
df_orig.fillna(0)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
3708 次 |
最近记录: |