Pandas Column数学运算无错误无答案

Question

Pandas Column数学运算无错误无答案

Sit*_*ogz 9 python csv datetime multiple-columns pandas

我试图对文件执行一些简单的数学运算.

下面的列file_1.csv本质上是动态的,列数会不时增加.所以我们不能修复last_column

master_ids.csv :在进行任何预处理之前

Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345

Run Code Online (Sandbox Code Playgroud)

master_count.csv :任何处理之前

Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300

Run Code Online (Sandbox Code Playgroud)

master_Ids.csv :经过一次预处理

Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500

Run Code Online (Sandbox Code Playgroud)

master_count.csv:预期输出(追加/合并)

Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750

Run Code Online (Sandbox Code Playgroud)

例如:Ids: 1234出现2这样的值乘以ids:1234在current time (00:30:00)是500这是通过计数被分割ids发生,然后从添加到相应的值ref1,并创建与当前时间的新列.

master_Ids.csv :经过另一次预处理

Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600

Run Code Online (Sandbox Code Playgroud)

master_count.csv:另一次执行后的预期输出(合并/追加)

Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600

Run Code Online (Sandbox Code Playgroud)

所以在这里current time是00:45:00的,我们划分current time value由count的ids事件,然后add到相应的ref1创建与新列的值new current time.

节目:李健勋

import pandas as pd
import numpy as np

csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])

# do the division by number of occurence of each Ids 
# and add column any time series
def my_func(group):
    num_obs = len(group)
    # process with column name after next timeseries (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group

result = temp.groupby(level='Ids').apply(my_func)

Run Code Online (Sandbox Code Playgroud)

程序执行时没有错误,也没有输出.需要一些修复建议.

Answer 1

Joh*_*hnE 3

该程序假设 master_counts.csv 和 master_ids.csv 随着时间的推移而更新，并且对于更新的时间应该是稳健的。也就是说，如果在同一更新上多次运行或错过更新，它应该产生正确的结果。

# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]

# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')

for i in range( 2, len(master_ids.columns) ):
    master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
    count = master_counts.groupby('Ids')['ref1'].transform('count')
    master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count

master_counts.to_csv('master_counts.csv',index=False)

%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，8 月前
查看次数：	331 次
最近记录：	10 年，7 月前