使用来自Zillow研究数据站点的数据主要是城市级别.数据结构是6列包含城市相关信息,其余245列包含月销售价格.我使用下面的代码显示数据样本
import pandas as pd
from tabulate import tabulate
df = pd.read_csv("City_Zhvi_AllHomes.csv")
c = df.columns.tolist()
cols = c[:7]
cols.append(c[-1])
print (tabulate(df[cols].iloc[23:29], headers = 'keys', tablefmt = 'orgtbl'))
Run Code Online (Sandbox Code Playgroud)
上面的代码将打印一个样本,如下所示:
| | RegionID | RegionName | State | Metro | CountyName | SizeRank | 1996-04 | 2016-08 |
|----+------------+---------------+---------+---------------+--------------+------------+-----------+-----------|
| 23 | 5976 | Milwaukee | WI | Milwaukee | Milwaukee | 24 | 68100 | 99500 |
| 24 | 7481 | Tucson | AZ | Tucson | Pima | 25 | 91500 | 153000 |
| 25 | 13373 | Portland | OR | Portland | Multnomah | 26 | 121100 | 390500 |
| 26 | 33225 | Oklahoma City | OK | Oklahoma City | Oklahoma | 27 | 64900 | 130500 |
| 27 | 40152 | Omaha | NE | Omaha | Douglas | 28 | 88900 | 143800 |
| 28 | 23429 | Albuquerque | NM | Albuquerque | Bernalillo | 29 | 115400 | 172000 |
Run Code Online (Sandbox Code Playgroud)
部分df
是时间序列,这里的技巧是将时间依赖列与其余列分开,使用pandas
resample
和to_datetime
假设我们只对总结1998-2000年的销售情况感兴趣
这将使我们能够选择列
# seperate time columns and convert their names to datetime
tdf = df[df.columns[6:]].rename(columns=pd.to_datetime)
# find the columns in the period 1998-2000
cols = tdf.columns
sel_cols = cols[(cols > '1997-12-31') & (cols < '2000')]
# select the columns, resample on columns
# calculate the mean
# rename the columns the way we like
mdf = tdf[sel_cols].resample('6M',axis=1).mean().rename(
columns=lambda x: '{:}${:}'.format(x.year, [1, 2][x.quarter > 2]))
# reattach non-time columns
mdf[df.columns[:6]] = df[df.columns[:6]]
print (tabulate(mdf[mdf.columns[0:9]].iloc[
23:29], headers='keys', tablefmt='orgtbl'))
Run Code Online (Sandbox Code Playgroud)
上面的代码将打印一个样本,如下所示:
| | 1998$1 | 1998$2 | 1999$1 | 1999$2 | 2000$1 | RegionID | RegionName | State | Metro |
|----+----------+----------+----------+----------+----------+------------+---------------+---------+---------------|
| 23 | 71900 | 72483.3 | 72616.7 | 74266.7 | 75920 | 5976 | Milwaukee | WI | Milwaukee |
| 24 | 94200 | 95133.3 | 96533.3 | 99100 | 100600 | 7481 | Tucson | AZ | Tucson |
| 25 | 139000 | 141900 | 145233 | 148900 | 151980 | 13373 | Portland | OR | Portland |
| 26 | 68500 | 69616.7 | 72016.7 | 73616.7 | 74900 | 33225 | Oklahoma City | OK | Oklahoma City |
| 27 | 98200 | 99250 | 103367 | 109083 | 112160 | 40152 | Omaha | NE | Omaha |
| 28 | 121000 | 122050 | 122833 | 123633 | 124420 | 23429 | Albuquerque | NM | Albuquerque |
Run Code Online (Sandbox Code Playgroud)
问题是:
重新采样结果的最后一列,尽管选择使用<'2000',但年份为"2000",为什么?
编辑:为了好玩,我提供了一个更"宽松"的方法来做上述事情
import pandas as pd
housing = pd.read_csv('City_Zhvi_AllHomes.csv',
index_col=list(range(6))).filter(
regex='199[8-9]-[0-1][0-9]').rename(
columns=pd.to_datetime).resample('2Q',
closed='left',axis=1).mean().rename(
columns=lambda x: str(x.to_period('2Q')).replace(
'Q','$').replace('2','1').replace('4','2')).reset_index()
Run Code Online (Sandbox Code Playgroud)
这提供了期望的结果,打印输出housing.iloc[23:27,4:]
如下所示
| | CountyName | SizeRank | 1998$1 | 1998$2 | 1999$1 | 1999$2 |
|----+--------------+------------+----------+----------+----------+----------|
| 23 | Milwaukee | 24 | 72366.7 | 72583.3 | 73916.7 | 75750 |
| 24 | Pima | 25 | 94883.3 | 96183.3 | 98783.3 | 100450 |
| 25 | Multnomah | 26 | 141167 | 144733 | 148183 | 151767 |
| 26 | Oklahoma | 27 | 69300 | 71550 | 73466.7 | 74766.7 |
Run Code Online (Sandbox Code Playgroud)
考虑使用pandas' 的封闭式resample
参数来决定:
bin间隔的哪一侧是封闭的
下面使用left
6 个月的截止值6/30 and 12/31
来代替1/1 and 7/1
产生2000值的值:
mdf = tdf[sel_cols].T.resample('6M', closed='left').mean().T.rename(
columns=lambda x: '{:}${:}'.format(x.year, [1, 2][x.quarter > 2]))
mdf[df.columns[:6]] = df[df.columns[:6]]
print(mdf.head())
# 1998$1 1998$2 1999$1 1999$2 RegionID RegionName State Metro CountyName SizeRank
# 0 NaN NaN NaN NaN 6181 New York NY New York Queens 1
# 1 169183.333333 179166.666667 189116.666667 198466.666667 12447 Los Angeles CA Los Angeles-Long Beach-Anaheim Los Angeles 2
# 2 117700.000000 121666.666667 125550.000000 133000.000000 17426 Chicago IL Chicago Cook 3
# 3 50550.000000 50650.000000 51150.000000 51866.666667 13271 Philadelphia PA Philadelphia Philadelphia 4
# 4 97583.333333 101083.333333 104816.666667 108566.666667 40326 Phoenix AZ Phoenix Maricopa 5
print(mdf[mdf['Metro'].isin(['Milwaukee', 'Tucson', 'Portland', 'Oklahoma City', 'Omaha', 'Albuquerque'])].head())
# 1998$1 1998$2 1999$1 1999$2 RegionID RegionName State Metro CountyName SizeRank
# 23 72366.666667 72583.333333 73916.666667 75750.000000 5976 Milwaukee WI Milwaukee Milwaukee 24
# 24 94883.333333 96183.333333 98783.333333 100450.000000 7481 Tucson AZ Tucson Pima 25
# 25 141166.666667 144733.333333 148183.333333 151766.666667 13373 Portland OR Portland Multnomah 26
# 26 98950.000000 102450.000000 108016.666667 112116.666667 40152 Omaha NE Omaha Douglas 27
# 27 121816.666667 122666.666667 123550.000000 124333.333333 23429 Albuquerque NM Albuquerque Bernalillo 28
Run Code Online (Sandbox Code Playgroud)
顺便说一下,考虑将数据重塑melt
为长格式,聚合半年,然后pivot_table
恢复为宽格式。诚然,这里的性能有所降低,但可以说更具可读性(主要是Half_Year字符串连接是瓶颈)。您确实获得了用于其他聚合和/或建模的长格式数据集:
import pandas as pd
import datetime as dt
import numpy as np
# MELT (WIDE --> LONG)
idcols = ['RegionID', 'RegionName', 'State', 'Metro']
mdf = pd.melt(df, id_vars=idcols + ['CountyName', 'SizeRank'], var_name='Year_Month', value_name='Sale_Amt').reset_index()
# CALCULATE HALF_YEAR STRING
mdf['Year_Month'] = pd.to_datetime(mdf['Year_Month'])
mdf['Half_Year'] = mdf['Year_Month'].dt.year.astype(str) + '$' + np.where(mdf['Year_Month'].dt.month <= 6, 1, 2).astype(str)
# FILTER DATASET BY DATE INTERVAL
mdf = mdf[mdf['Year_Month'].between('1998-01-01', '1999-12-31')]
# GROUP BY AGGREGATION OF HOUSE SALES
mdf = mdf.groupby(idcols + ['Half_Year'])['Sale_Amt'].mean().reset_index()
# PIVOT (LONG --> WIDE)
pvtdf = mdf.pivot_table(index=idcols, columns='Half_Year', values='Sale_Amt', aggfunc=sum).reset_index()
Run Code Online (Sandbox Code Playgroud)
输出
metros = ['Milwaukee', 'Tucson', 'Portland', 'Oklahoma City', 'Omaha', 'Albuquerque']
print(pvtdf[(pvtdf['RegionName'].isin(metros)) & (pvtdf['Metro'].isin(metros))])
# Half_Year RegionID RegionName State Metro 1998$1 1998$2 1999$1 1999$2
# 430 5976 Milwaukee WI Milwaukee 72366.666667 72583.333333 73916.666667 75750.000000
# 680 7481 Tucson AZ Tucson 94883.333333 96183.333333 98783.333333 100450.000000
# 1584 13373 Portland OR Portland 141166.666667 144733.333333 148183.333333 151766.666667
# 2923 23429 Albuquerque NM Albuquerque 121816.666667 122666.666667 123550.000000 124333.333333
# 5473 40152 Omaha NE Omaha 98950.000000 102450.000000 108016.666667 112116.666667
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
4062 次 |
最近记录: |