使用python中的csv.DictReader进行数据类型转换的最快方法

oli*_*oli 8 python csv dictionary type-conversion

我正在使用python中的CSV文件,在使用时将有大约100,000行.每行都有一组维度(作为字符串)和一个指标(浮点数).

由于csv.DictReader或csv.reader仅将值返回为字符串,因此我正在迭代所有行并将一个数值转换为float.

for i in csvDict:
    i[col] = float(i[col])
Run Code Online (Sandbox Code Playgroud)

有没有更好的方法,任何人都可以建议这样做?我一直在玩地图,izip,itertools的各种组合,并且已经广泛搜索了一些更有效地做这些的样本,但遗憾的是没有取得多大成功.

如果它有帮助:我在appengine上做这个.我相信我正在做的事情可能导致我遇到这个错误:在处理11个请求总数后超过了267.789 MB的软进程大小限制 - 我只在CSV非常大时得到它.

编辑:我的目标 我正在解析此CSV,以便我可以将其用作Google Visualizations API数据源.最终数据集将加载到gviz DataTable中进行查询.必须在构造此表期间指定类型.如果有人在python中知道一个好的gviz csv-> datatable转换器,我的问题也可以解决!

Edit2:我的代码

我相信我的问题与我尝试修复cmvTypes()的方式有关.此外,data_table.LoadData()需要一个可迭代的对象.

class GvizFromCsv(object):
  """Convert CSV to Gviz ready objects."""

  def __init__(self, csvFile, dateTimeFormat=None):
    self.fileObj = StringIO.StringIO(csvFile)
    self.csvDict = list(csv.DictReader(self.fileObj))
    self.dateTimeFormat = dateTimeFormat
    self.headers = {}
    self.ParseHeaders()
    self.fixCsvTypes()

  def IsNumber(self, st):
    try:
        float(st)
        return True
    except ValueError:
        return False

  def IsDate(self, st):
    try:
      datetime.datetime.strptime(st, self.dateTimeFormat)
    except ValueError:
      return False

  def ParseHeaders(self):
    """Attempts to figure out header types for gviz, based on first row"""
    for k, v in self.csvDict[0].items():
      if self.IsNumber(v):
        self.headers[k] = 'number'
      elif self.dateTimeFormat and self.IsDate(v):
        self.headers[k] = 'date'
      else:
        self.headers[k] = 'string'

  def fixCsvTypes(self):
    """Only fixes numbers."""
    update_to_numbers = []
    for k,v in self.headers.items():
      if v == 'number':
        update_to_numbers.append(k)
    for i in self.csvDict:
      for col in update_to_numbers:
        i[col] = float(i[col])

  def CreateDataTable(self):
    """creates a gviz data table"""
    data_table = gviz_api.DataTable(self.headers)
    data_table.LoadData(self.csvDict)
    return data_table
Run Code Online (Sandbox Code Playgroud)

eyq*_*uem 2

我首先使用正则表达式利用了 CSV 文件,但由于文件中的数据在每一行中排列得非常严格,我们可以简单地使用 split ()函数

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    def transf(surname,x,y):
        return (surname,float(x),float(y))

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
    # to populate the data table by iterating in the CSV file
Run Code Online (Sandbox Code Playgroud)

或者没有定义函数:

import gviz_api

scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)

#  --- lines in surnames.csv are : --- 
#  surname,percent,cumulative percent,rank\n
#  SMITH,1.006,1.006,1,\n
#  JOHNSON,0.810,1.816,2,\n
#  WILLIAMS,0.699,2.515,3,\n

with open('surnames.csv') as f:

    f.readline()
    # to skip the first line surname,percent,cumulative percent,rank\n

    datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )    
    # to populate the data table by iterating in the CSV file
Run Code Online (Sandbox Code Playgroud)

有一瞬间,我相信我必须一次填充一行数据表,因为我使用的是正则表达式,并且需要在浮动数字字符串之前获取匹配的组。使用split(),所有操作都可以通过LoadData()在一条指令中完成

因此,您的代码可以被缩短。顺便说一句,我不明白为什么它应该继续定义一个类。相反,一个函数对我来说似乎就足够了:

def GvizFromCsv(filename):
  """ creates a gviz data table from a CSV file """

  data_table = gviz_api.DataTable([('col1','string','SURNAME'),
                                   ('col2','number','ONE'    ),
                                   ('col3','number','TWO'    ) ])

  #  --- with such a table schema , lines in the file must be like that: ---  
  #  blah, number, number, ...anything else...\n 
  #  SMITH,1.006,1.006, ...anything else...\n 
  #  JOHNSON,0.810,1.816, ...anything else...\n 
  #  WILLIAMS,0.699,2.515, ...anything else...\n

  with open(filename) as f:
    data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
                         for line in f )
  return data_table
Run Code Online (Sandbox Code Playgroud)

现在您必须检查是否可以在这段代码中插入从另一个 API 读取 CSV 数据的方式,以保持填充数据表的迭代原则。