Sit*_*ogz 9 python csv datetime multiple-columns pandas
我试图对文件执行一些简单的数学运算.
下面的列file_1.csv本质上是动态的,列数会不时增加.所以我们不能修复last_column
master_ids.csv :在进行任何预处理之前
Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345
Run Code Online (Sandbox Code Playgroud)
master_count.csv :任何处理之前
Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300
Run Code Online (Sandbox Code Playgroud)
master_Ids.csv :经过一次预处理
Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500
Run Code Online (Sandbox Code Playgroud)
master_count.csv:预期输出(追加/合并)
Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750
Run Code Online (Sandbox Code Playgroud)
例如:Ids: 1234出现2这样的值乘以ids:1234在current time (00:30:00)是500这是通过计数被分割ids发生,然后从添加到相应的值ref1,并创建与当前时间的新列.
master_Ids.csv :经过另一次预处理
Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600
Run Code Online (Sandbox Code Playgroud)
master_count.csv:另一次执行后的预期输出(合并/追加)
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600
Run Code Online (Sandbox Code Playgroud)
所以在这里current time是00:45:00的,我们划分current time value由count的ids事件,然后add到相应的ref1创建与新列的值new current time.
节目:李健勋
import pandas as pd
import numpy as np
csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'
df1 = pd.read_csv(csv_file1).set_index('Ids')
# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()
# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])
# do the division by number of occurence of each Ids
# and add column any time series
def my_func(group):
num_obs = len(group)
# process with column name after next timeseries (inclusive)
group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
return group
result = temp.groupby(level='Ids').apply(my_func)
Run Code Online (Sandbox Code Playgroud)
程序执行时没有错误,也没有输出.需要一些修复建议.
该程序假设 master_counts.csv 和 master_ids.csv 随着时间的推移而更新,并且对于更新的时间应该是稳健的。也就是说,如果在同一更新上多次运行或错过更新,它应该产生正确的结果。
# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]
# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')
for i in range( 2, len(master_ids.columns) ):
master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
count = master_counts.groupby('Ids')['ref1'].transform('count')
master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count
master_counts.to_csv('master_counts.csv',index=False)
%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
331 次 |
| 最近记录: |