将csv文件加载到numpy并按名称访问列

use*_*422 9 python csv arrays numpy

我有一个csv标题文件,如:

鉴于此test.csv文件:

"A","B","C","D","E","F","timestamp"
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291111964948E12
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291113113366E12
611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291120650486E12
Run Code Online (Sandbox Code Playgroud)

我只想将它作为矩阵/ ndarray加载3行和7列,我也想column vectors从给定的访问column name.如果我使用genfromtxt(如下所示),我得到一个3行(每行一个)而没有列的ndarray.

r = np.genfromtxt('test.csv',delimiter=',',dtype=None, names=True)
print r
print r.shape

[ (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291111964948.0)
 (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291113113366.0)
 (611.88243, 9089.5601000000006, 5133.0, 864.07514000000003, 1715.3747599999999, 765.22776999999996, 1291120650486.0)]
(3,)
Run Code Online (Sandbox Code Playgroud)

我可以从列名中获取列向量,如下所示:

print r['A']
  [ 611.88243  611.88243  611.88243]
Run Code Online (Sandbox Code Playgroud)

如果,我使用load.txt然后我得到3行和7列的数组但无法columns使用column名称访问(如下所示).

numpy.loadtxt(open("test.csv","rb"),delimiter=",",skiprows=1)
Run Code Online (Sandbox Code Playgroud)

我明白了

  [ [611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291111964948E12]
    [611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291113113366E12]
    [611.88243,9089.5601,5133.0,864.07514,1715.37476,765.22777,1.291120650486E12] ]
Run Code Online (Sandbox Code Playgroud)

有没有办法Python可以同时实现两个要求(access columns by coluumn name like np.genfromtext and have a matrix like np.loadtxt)?

unu*_*tbu 9

单独使用numpy,您显示的选项是您唯一的选择.使用具有形状(3,7)的均匀d型的ndarray,或(可能)异质dtype和形状(3,)的结构化阵列.

如果你真的想要一个带有标签列和形状(3,7)的数据结构(以及许多其他好东西),你可以使用 pandas DataFrame:

In [67]: import pandas as pd
In [68]: df = pd.read_csv('data'); df
Out[68]: 
           A          B     C          D           E          F     timestamp
0  611.88243  9089.5601  5133  864.07514  1715.37476  765.22777  1.291112e+12
1  611.88243  9089.5601  5133  864.07514  1715.37476  765.22777  1.291113e+12
2  611.88243  9089.5601  5133  864.07514  1715.37476  765.22777  1.291121e+12    

In [70]: df['A']
Out[70]: 
0    611.88243
1    611.88243
2    611.88243
Name: A, dtype: float64

In [71]: df.shape
Out[71]: (3, 7)
Run Code Online (Sandbox Code Playgroud)

纯NumPy/Python替代方法是使用dict将列名映射到索引:

import numpy as np
import csv
with open(filename) as f:
    reader = csv.reader(f)
    columns = next(reader)
    colmap = dict(zip(columns, range(len(columns))))

arr = np.matrix(np.loadtxt(filename, delimiter=",", skiprows=1))
print(arr[:, colmap['A']])
Run Code Online (Sandbox Code Playgroud)

产量

[[ 611.88243]
 [ 611.88243]
 [ 611.88243]]
Run Code Online (Sandbox Code Playgroud)

这样,arr是一个NumPy矩阵,其列可以使用语法通过标签访问

arr[:, colmap[column_name]]
Run Code Online (Sandbox Code Playgroud)