3 python csv numpy matplotlib scipy
我在python中读取由制表符分隔的csv文件时遇到问题.我使用以下功能:
def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
"""
Parse a file name into an array. Return the array and additional header lines. By default,
parse the header lines into dictionaries, assuming the parameters are numeric,
using 'parse_header'.
"""
f = open(filename, 'r')
skipped_rows = []
for n in range(skiprows):
header_line = f.readline().strip()
if raw_header:
skipped_rows.append(header_line)
else:
skipped_rows.append(parse_header(header_line))
f.close()
if missing:
data = genfromtxt(filename, dtype=None, names=with_header,
deletechars='', skiprows=skiprows, missing=missing)
else:
if delimiter != '\t':
data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
deletechars='', skiprows=skiprows)
else:
data = genfromtxt(filename, dtype=None, names=with_header,
deletechars='', skiprows=skiprows)
if data.ndim == 0:
data = array([data.item()])
return (data, skipped_rows)
Run Code Online (Sandbox Code Playgroud)
问题是genfromtxt抱怨我的文件,例如错误:
Line #27100 (got 12 columns instead of 16)
Run Code Online (Sandbox Code Playgroud)
我不确定这些错误来自哪里.有任何想法吗?
这是导致问题的示例文件:
#Gene 120-1 120-3 120-4 30-1 30-3 30-4 C-1 C-2 C-5 genesymbol genedesc
ENSMUSG00000000001 7.32 9.5 7.76 7.24 11.35 8.83 6.67 11.35 7.12 Gnai3 guanine nucleotide binding protein alpha
ENSMUSG00000000003 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn probasin
Run Code Online (Sandbox Code Playgroud)
有没有更好的方法来编写通用的csv2array函数?谢谢.
查看python CSV模块:http://docs.python.org/library/csv.html
import csv
reader = csv.reader(open("myfile.csv", "rb"),
delimiter='\t', quoting=csv.QUOTE_NONE)
header = []
records = []
fields = 16
if thereIsAHeader: header = reader.next()
for row, record in enumerate(reader):
if len(record) != fields:
print "Skipping malformed record %i, contains %i fields (%i expected)" %
(record, len(record), fields)
else:
records.append(record)
# do numpy stuff.
Run Code Online (Sandbox Code Playgroud)