oli*_*oli 8 python csv dictionary type-conversion
我正在使用python中的CSV文件,在使用时将有大约100,000行.每行都有一组维度(作为字符串)和一个指标(浮点数).
由于csv.DictReader或csv.reader仅将值返回为字符串,因此我正在迭代所有行并将一个数值转换为float.
for i in csvDict:
i[col] = float(i[col])
Run Code Online (Sandbox Code Playgroud)
有没有更好的方法,任何人都可以建议这样做?我一直在玩地图,izip,itertools的各种组合,并且已经广泛搜索了一些更有效地做这些的样本,但遗憾的是没有取得多大成功.
如果它有帮助:我在appengine上做这个.我相信我正在做的事情可能导致我遇到这个错误:在处理11个请求总数后超过了267.789 MB的软进程大小限制 - 我只在CSV非常大时得到它.
编辑:我的目标 我正在解析此CSV,以便我可以将其用作Google Visualizations API的数据源.最终数据集将加载到gviz DataTable中进行查询.必须在构造此表期间指定类型.如果有人在python中知道一个好的gviz csv-> datatable转换器,我的问题也可以解决!
Edit2:我的代码
我相信我的问题与我尝试修复cmvTypes()的方式有关.此外,data_table.LoadData()需要一个可迭代的对象.
class GvizFromCsv(object):
"""Convert CSV to Gviz ready objects."""
def __init__(self, csvFile, dateTimeFormat=None):
self.fileObj = StringIO.StringIO(csvFile)
self.csvDict = list(csv.DictReader(self.fileObj))
self.dateTimeFormat = dateTimeFormat
self.headers = {}
self.ParseHeaders()
self.fixCsvTypes()
def IsNumber(self, st):
try:
float(st)
return True
except ValueError:
return False
def IsDate(self, st):
try:
datetime.datetime.strptime(st, self.dateTimeFormat)
except ValueError:
return False
def ParseHeaders(self):
"""Attempts to figure out header types for gviz, based on first row"""
for k, v in self.csvDict[0].items():
if self.IsNumber(v):
self.headers[k] = 'number'
elif self.dateTimeFormat and self.IsDate(v):
self.headers[k] = 'date'
else:
self.headers[k] = 'string'
def fixCsvTypes(self):
"""Only fixes numbers."""
update_to_numbers = []
for k,v in self.headers.items():
if v == 'number':
update_to_numbers.append(k)
for i in self.csvDict:
for col in update_to_numbers:
i[col] = float(i[col])
def CreateDataTable(self):
"""creates a gviz data table"""
data_table = gviz_api.DataTable(self.headers)
data_table.LoadData(self.csvDict)
return data_table
Run Code Online (Sandbox Code Playgroud)
我首先使用正则表达式利用了 CSV 文件,但由于文件中的数据在每一行中排列得非常严格,我们可以简单地使用 split ()函数
import gviz_api
scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)
# --- lines in surnames.csv are : ---
# surname,percent,cumulative percent,rank\n
# SMITH,1.006,1.006,1,\n
# JOHNSON,0.810,1.816,2,\n
# WILLIAMS,0.699,2.515,3,\n
with open('surnames.csv') as f:
def transf(surname,x,y):
return (surname,float(x),float(y))
f.readline()
# to skip the first line surname,percent,cumulative percent,rank\n
data_table.LoadData( transf(*line.split(',')[0:3]) for line in f )
# to populate the data table by iterating in the CSV file
Run Code Online (Sandbox Code Playgroud)
或者没有定义函数:
import gviz_api
scheme = [('col1','string','SURNAME'),('col2','number','ONE'),('col3','number','TWO')]
data_table = gviz_api.DataTable(scheme)
# --- lines in surnames.csv are : ---
# surname,percent,cumulative percent,rank\n
# SMITH,1.006,1.006,1,\n
# JOHNSON,0.810,1.816,2,\n
# WILLIAMS,0.699,2.515,3,\n
with open('surnames.csv') as f:
f.readline()
# to skip the first line surname,percent,cumulative percent,rank\n
datdata_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])] for line in f )
# to populate the data table by iterating in the CSV file
Run Code Online (Sandbox Code Playgroud)
有一瞬间,我相信我必须一次填充一行数据表,因为我使用的是正则表达式,并且需要在浮动数字字符串之前获取匹配的组。使用split(),所有操作都可以通过LoadData()在一条指令中完成
。
因此,您的代码可以被缩短。顺便说一句,我不明白为什么它应该继续定义一个类。相反,一个函数对我来说似乎就足够了:
def GvizFromCsv(filename):
""" creates a gviz data table from a CSV file """
data_table = gviz_api.DataTable([('col1','string','SURNAME'),
('col2','number','ONE' ),
('col3','number','TWO' ) ])
# --- with such a table schema , lines in the file must be like that: ---
# blah, number, number, ...anything else...\n
# SMITH,1.006,1.006, ...anything else...\n
# JOHNSON,0.810,1.816, ...anything else...\n
# WILLIAMS,0.699,2.515, ...anything else...\n
with open(filename) as f:
data_table.LoadData( [el if n==0 else float(el) for n,el in enumerate(line.split(',')[0:3])]
for line in f )
return data_table
Run Code Online (Sandbox Code Playgroud)
。
现在您必须检查是否可以在这段代码中插入从另一个 API 读取 CSV 数据的方式,以保持填充数据表的迭代原则。
| 归档时间: |
|
| 查看次数: |
6125 次 |
| 最近记录: |