在python中将两个CSV与唯一列合并

Vic*_*ice 2 python csv

我有两个CSV文件代表两个不同年份的数据.我知道如何使用csvwriter和dictkeys进行基本合并,但问题在于:虽然CSV主要是共享列标题,但每个列可能都有唯一的列.如果一个物种在一年内被捕获而不是另一个物种捕获,则该列仅在那一年出现.如何将新数据合并到旧数据,创建新列并在这些列中填零旧数据?

档案1: "Date","Time","Species A","Species B", "Species X"

文件2: "Date","Time", "Species A", "Species B", "Species C"

我需要最终结果是这个标题的一个csv:"Date","Time","Species A","Species B", "Species C", "Species X"

DSM*_*DSM 5

其他人可能会使用该csv模块发布一个解决方案,所以我会给出一个熊猫解决方案用于比较目的:

import pandas as pd

df1 = pd.read_csv("fish1.csv")
df2 = pd.read_csv("fish2.csv")

df = pd.concat([df1, df2]).fillna(0)
df = df[["Date", "Time"] + list(df.columns[1:-1])]
df.to_csv("merged_fish.csv", index=False)
Run Code Online (Sandbox Code Playgroud)

说明:

首先,我们读入两个文件:

>>> df1 = pd.read_csv("fish1.csv")
>>> df2 = pd.read_csv("fish2.csv")
>>> df1
   Date  Time  Species A  Species B  Species X
0     1     2          3          4          5
1     6     7          8          9         10
2    11    12         13         14         15
>>> df2
   Date  Time  Species A  Species B  Species C
0    16    17         18         19         20
1    21    22         23         24         25
2    26    27         28         29         30
Run Code Online (Sandbox Code Playgroud)

然后我们简单地连接它们,它自动填充缺少的数据NaN:

>>> df = pd.concat([df1, df2])
>>> df
   Date  Species A  Species B  Species C  Species X  Time
0     1          3          4        NaN          5     2
1     6          8          9        NaN         10     7
2    11         13         14        NaN         15    12
0    16         18         19         20        NaN    17
1    21         23         24         25        NaN    22
2    26         28         29         30        NaN    27
Run Code Online (Sandbox Code Playgroud)

你希望它们用0填充,所以:

>>> df = pd.concat([df1, df2]).fillna(0)
>>> df
   Date  Species A  Species B  Species C  Species X  Time
0     1          3          4          0          5     2
1     6          8          9          0         10     7
2    11         13         14          0         15    12
0    16         18         19         20          0    17
1    21         23         24         25          0    22
2    26         28         29         30          0    27
Run Code Online (Sandbox Code Playgroud)

这个顺序是不太你问了一个,不过,你想TimeDate第一,所以:

>>> df = df[["Date", "Time"] + list(df.columns[1:-1])]
>>> df
   Date  Time  Species A  Species B  Species C  Species X
0     1     2          3          4          0          5
1     6     7          8          9          0         10
2    11    12         13         14          0         15
0    16    17         18         19         20          0
1    21    22         23         24         25          0
2    26    27         28         29         30          0
Run Code Online (Sandbox Code Playgroud)

然后我们将其保存为CSV文件:

>>> df.to_csv("merged_fish.csv", index=False)
Run Code Online (Sandbox Code Playgroud)

生产

Date,Time,Species A,Species B,Species C,Species X
1,2,3,4,0.0,5.0
6,7,8,9,0.0,10.0
11,12,13,14,0.0,15.0
16,17,18,19,20.0,0.0
21,22,23,24,25.0,0.0
26,27,28,29,30.0,0.0
Run Code Online (Sandbox Code Playgroud)