use*_*044 4 python python-2.7 pandas
我有一个CSV客户购买的文件,没有特定的顺序,我读到了Pandas Dataframe. 我想为每次购买添加一列,并按客户分组显示自上次购买以来已经过去了多长时间。我不确定差异在哪里,但它们太大了(即使在几秒钟内)。
CSV:
Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015
Run Code Online (Sandbox Code Playgroud)
Python:
import pandas as pd
import time
start = time.time()
data = pd.read_csv('data.csv', low_memory=False)
data = data.sort_values(by=['Customer Id', 'Purchase Date'])
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])
data['Purchase Difference'] = (data.groupby(['Customer Id'])['Purchase Date']
.diff()
.fillna('-')
)
print data
Run Code Online (Sandbox Code Playgroud)
输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 2678400000000000
4 2322 2015-03-01 2419200000000000
0 4543 2015-01-01 -
1 4543 2015-02-05 3024000000000000
2 4543 2015-03-15 328320000000000
Run Code Online (Sandbox Code Playgroud)
期望输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01 -
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
Run Code Online (Sandbox Code Playgroud)
一旦它被转换为时间戳,您就可以应用diff到该Purchase Date列。
df['Purchase Date'] = pd.to_datetime(df['Purchase Date'])
df.sort_values(['Customer Id', 'Purchase Date'], inplace=True)
df['Purchase Difference'] = \
[str(n.days) + ' day' + 's' if n > pd.Timedelta(days=1) else '' if pd.notnull(n) else ""
for n in df.groupby('Customer Id', sort=False)['Purchase Date'].diff()]
>>> df
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
6 4543 2015-03-15
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7754 次 |
| 最近记录: |