在python中，将多个带有不同标题的CSV读入一个数据帧

Question

在python中，将多个带有不同标题的CSV读入一个数据帧

Fra*_*vey 3 python csv netcdf dataframe pandas

我有几十个带有类似（但并不总是完全相同）标题的 csv 文件。例如，一个有：

Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site

Run Code Online (Sandbox Code Playgroud)

一个有：

Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site

Run Code Online (Sandbox Code Playgroud)

（注意一个没有“U_Global”和“U_IR”，另一个有“Diffuse2”而不是“Diffuse”）

我知道如何将多个 csv 传递到我的脚本中，但是如何将 csv 的唯一传递值传递给它们当前具有值的列？也许将“Nan”传递给该行中的所有其他列。

理想情况下，我会有类似的东西：

'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site'
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER"
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER"
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT"
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT"

Run Code Online (Sandbox Code Playgroud)

另一个警告是，需要附加此数据框。它需要保留，因为多个 csv 文件被传递到它。我想我可能会在最后将它写出到它自己的 csv（它最终会转到 NETCDF4）。

Answer 1

Max*_*axU 5

假设您有以下 CSV 文件：

测试1.csv：

year,month,day,Direct 
1992,1,1,11
2013,5,30,11
2004,9,1,11

Run Code Online (Sandbox Code Playgroud)

测试2.csv：

year,month,day,Direct,Direct2
1992,1,1,21,201
2013,5,30,21,202
2004,9,1,21,203

Run Code Online (Sandbox Code Playgroud)

测试3.csv：

year,month,day,File3
1992,1,1,text1
2013,5,30,text2
2004,9,1,text3
2016,1,1,unmatching_date

Run Code Online (Sandbox Code Playgroud)

解决方案：

import glob
import pandas as pd

files = glob.glob(r'd:/temp/test*.csv')

def get_merged(files, **kwargs):
    df = pd.read_csv(files[0], **kwargs)
    for f in files[1:]:
        df = df.merge(pd.read_csv(f, **kwargs), how='outer')
    return df

print(get_merged(files))

Run Code Online (Sandbox Code Playgroud)

输出：

   year  month  day  Direct   Direct  Direct2            File3
0  1992      1    1     11.0    21.0    201.0            text1
1  2013      5   30     11.0    21.0    202.0            text2
2  2004      9    1     11.0    21.0    203.0            text3
3  2016      1    1      NaN     NaN      NaN  unmatching_date

Run Code Online (Sandbox Code Playgroud)

更新：通常的惯用pd.concat(list_of_dfs)解决方案在这里不起作用，因为它是通过索引连接的：

In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[192]:
   Direct  Direct   Direct2            File3  day  month  year
0     NaN     11.0      NaN              NaN    1      1  1992
1     NaN     11.0      NaN              NaN   30      5  2013
2     NaN     11.0      NaN              NaN    1      9  2004
3    21.0      NaN    201.0              NaN    1      1  1992
4    21.0      NaN    202.0              NaN   30      5  2013
5    21.0      NaN    203.0              NaN    1      9  2004
6     NaN      NaN      NaN            text1    1      1  1992
7     NaN      NaN      NaN            text2   30      5  2013
8     NaN      NaN      NaN            text3    1      9  2004
9     NaN      NaN      NaN  unmatching_date    1      1  2016

In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[193]:
       0    1     2     3       4    5     6     7      8     9   10  11               12
0  1992.0  1.0   1.0  11.0  1992.0  1.0   1.0  21.0  201.0  1992   1   1            text1
1  2013.0  5.0  30.0  11.0  2013.0  5.0  30.0  21.0  202.0  2013   5  30            text2
2  2004.0  9.0   1.0  11.0  2004.0  9.0   1.0  21.0  203.0  2004   9   1            text3
3     NaN  NaN   NaN   NaN     NaN  NaN   NaN   NaN    NaN  2016   1   1  unmatching_date

Run Code Online (Sandbox Code Playgroud)

或index_col=None明确使用：

In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[194]:
   Direct  Direct   Direct2            File3  day  month  year
0     NaN     11.0      NaN              NaN    1      1  1992
1     NaN     11.0      NaN              NaN   30      5  2013
2     NaN     11.0      NaN              NaN    1      9  2004
3    21.0      NaN    201.0              NaN    1      1  1992
4    21.0      NaN    202.0              NaN   30      5  2013
5    21.0      NaN    203.0              NaN    1      9  2004
6     NaN      NaN      NaN            text1    1      1  1992
7     NaN      NaN      NaN            text2   30      5  2013
8     NaN      NaN      NaN            text3    1      9  2004
9     NaN      NaN      NaN  unmatching_date    1      1  2016

In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[195]:
       0    1     2     3       4    5     6     7      8     9   10  11               12
0  1992.0  1.0   1.0  11.0  1992.0  1.0   1.0  21.0  201.0  1992   1   1            text1
1  2013.0  5.0  30.0  11.0  2013.0  5.0  30.0  21.0  202.0  2013   5  30            text2
2  2004.0  9.0   1.0  11.0  2004.0  9.0   1.0  21.0  203.0  2004   9   1            text3
3     NaN  NaN   NaN   NaN     NaN  NaN   NaN   NaN    NaN  2016   1   1  unmatching_date

Run Code Online (Sandbox Code Playgroud)

以下更惯用的解决方案有效，但它更改了列和行/数据的原始顺序：

In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')]
     ...:
     ...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs]))
     ...:
     ...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index()
     ...:
Out[224]:
   month  day  year  Direct   Direct  Direct2            File3
0      1    1  1992     11.0    21.0    201.0            text1
1      1    1  2016      NaN     NaN      NaN  unmatching_date
2      5   30  2013     11.0    21.0    202.0            text2
3      9    1  2004     11.0    21.0    203.0            text3

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，10 月前
查看次数：	3268 次
最近记录：	8 年，10 月前