将多个CSV文件中的列合并为一个文件

Sty*_*ize 13 python csv

我有一堆CSV文件(下面的例子中只有两个).每个CSV文件有6列.我想进入每个CSV文件,复制前两列并将它们作为新列添加到现有CSV文件中.

到目前为止,我有:

import csv

f = open('combined.csv')
data = [item for item in csv.reader(f)]
f.close()

for x in range(1,3): #example has 2 csv files, this will be automated
    n=0
    while n<2:
        f=open(str(x)+".csv")
        new_column=[item[n] for item in csv.reader(f)]
        f.close()
        #print d

        new_data = []

        for i, item in enumerate(data):
            try:
                item.append(new_column[i])
                print i
            except IndexError, e:
                item.append("")
            new_data.append(item)

        f = open('combined.csv', 'w')
        csv.writer(f).writerows(new_data)
        f.close()
        n=n+1
Run Code Online (Sandbox Code Playgroud)

这是有效的,它不漂亮,但它的工作原理.但是,我有三个小烦恼:

  1. 我打开每个CSV文件两次(每列一次),这很不优雅

  2. 当我打印combined.csv文件时,它会在每行后打印一个空行?

  3. 我必须提供一个combined.csv文件,其中至少包含与我可能拥有的最大文件一样多的行.由于我真的不知道这个数字是什么,这有点糟透了

一如既往,非常感谢任何帮助!

根据要求:1.csv看起来像(模拟数据)

1,a
2,b
3,c
4,d
Run Code Online (Sandbox Code Playgroud)

2.csv看起来像

5,e
6,f
7,g
8,h
9,i
Run Code Online (Sandbox Code Playgroud)

combined.csv文件看起来应该是这样的

1,a,5,e
2,b,6,f
3,c,7,g
4,d,8,h
,,9,i
Run Code Online (Sandbox Code Playgroud)

unu*_*tbu 7

import csv
import itertools as IT

filenames = ['1.csv', '2.csv']
handles = [open(filename, 'rb') for filename in filenames]    
readers = [csv.reader(f, delimiter=',') for f in handles]

with  open('combined.csv', 'wb') as h:
    writer = csv.writer(h, delimiter=',', lineterminator='\n', )
    for rows in IT.izip_longest(*readers, fillvalue=['']*2):
        combined_row = []
        for row in rows:
            row = row[:2] # select the columns you want
            if len(row) == 2:
                combined_row.extend(row)
            else:
                combined.extend(['']*2)
        writer.writerow(combined_row)

for f in handles:
    f.close()
Run Code Online (Sandbox Code Playgroud)

该行for rows in IT.izip_longest(*readers, fillvalue=['']*2): 可以用一个例子来理解:

In [1]: import itertools as IT

In [2]: readers = [(1,2,3), ('a','b','c','d'), (10,20,30,40)]

In [3]: list(IT.izip_longest(readers[0], readers[1], readers[2]))
Out[3]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]
Run Code Online (Sandbox Code Playgroud)

正如您所看到的,IT.izip_longest的行为非常类似zip,除非它在消耗最长的可迭代时才会停止.None默认情况下,它会填充缺少的项目.

现在如果有超过3个项目会发生什么readers?我们想写

list(IT.izip_longest(readers[0], readers[1], readers[2], ...))
Run Code Online (Sandbox Code Playgroud)

但这很费力,如果我们len(readers)事先不知道,我们甚至无法...用明确的东西替换省略号().

Python有一个解决方案:star(aka参数解包)语法:

In [4]: list(IT.izip_longest(*readers))
Out[4]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]
Run Code Online (Sandbox Code Playgroud)

注意结果Out[4]与结果相同Out[3].

*readers告诉Python来解压的项目readers,并沿着并将其作为单独的参数IT.izip_longest.这就是Python允许我们向函数发送任意数量的参数的方法.


DSM*_*DSM 7

现在,似乎几乎必须有人为Python中的任何数据处理问题提供基于熊猫的解决方案.所以这是我的:

import pandas as pd

to_merge = ['{}.csv'.format(i) for i in range(4)]
dfs = []
for filename in to_merge:
    # read the csv, making sure the first two columns are str
    df = pd.read_csv(filename, header=None, converters={0: str, 1: str})
    # throw away all but the first two columns
    df = df.ix[:,:1]
    # change the column names so they won't collide during concatenation
    df.columns = [filename + str(cname) for cname in df.columns]
    dfs.append(df)

# concatenate them horizontally
merged = pd.concat(dfs,axis=1)
# write it out
merged.to_csv("merged.csv", header=None, index=None)
Run Code Online (Sandbox Code Playgroud)

哪个用于文件

~/coding/pand/merge$ cat 0.csv 
0,a,6,5,3,7
~/coding/pand/merge$ cat 1.csv 
1,b,7,6,7,0
2,c,0,1,8,7
3,d,6,8,4,5
4,e,8,4,2,4
~/coding/pand/merge$ cat 2.csv 
5,f,6,2,9,1
6,g,0,3,2,7
7,h,6,5,1,9
~/coding/pand/merge$ cat 3.csv 
8,i,9,1,7,1
9,j,0,9,3,9
Run Code Online (Sandbox Code Playgroud)

In [21]: !cat merged.csv
0,a,1,b,5,f,8,i
,,2,c,6,g,9,j
,,3,d,7,h,,
,,4,e,,,,

In [22]: pd.read_csv("merged.csv", header=None)
Out[22]: 
    0    1  2  3   4    5   6    7
0   0    a  1  b   5    f   8    i
1 NaN  NaN  2  c   6    g   9    j
2 NaN  NaN  3  d   7    h NaN  NaN
3 NaN  NaN  4  e NaN  NaN NaN  NaN
Run Code Online (Sandbox Code Playgroud)

我认为这是正确的对齐方式.