alv*_*vas 8 python csv dictionary numpy pandas
我有一个标签分隔文件,其中包含10亿行(想象200列,而不是3列):
abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232
Run Code Online (Sandbox Code Playgroud)
我想创建一个字典,其中第一列中的字符串是键,其余是值.我一直在这样做,但它的计算成本很高:
import io
dictionary = {}
with io.open('bigfile', 'r') as fin:
for line in fin:
kv = line.strip().split()
k, v = kv[0], kv[1:]
dictionary[k] = list(map(float, v))
Run Code Online (Sandbox Code Playgroud)
我怎么能得到想要的字典?实际上,numpy数组比值的浮点数列表更合适.
您可以使用 pandas 加载 df,然后根据需要构造一个新的 df,然后调用to_dict
:
In [99]:
t="""abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None)
df = pd.DataFrame(columns = df[0], data = df.ix[:,1:].values)
df.to_dict()
Out[99]:
{'abc': {0: -0.12300000000000001,
1: -0.98080000000000001,
2: 0.23123000000000002},
'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
Run Code Online (Sandbox Code Playgroud)
编辑
一种更动态的方法,可以减少构建临时 df 的需要:
In [121]:
t="""abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232"""
# determine the number of cols, we'll use this in usecols
col_len = pd.read_csv(io.StringIO(t), sep='\s+', nrows=1).shape[1]
col_len
# read the first col we'll use this in names
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
# now read and construct the df using the determined usecols and names from above
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, col_len)), names = cols)
df.to_dict()
Out[121]:
{'abc': {0: -0.12300000000000001,
1: -0.98080000000000001,
2: 0.23123000000000002},
'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
Run Code Online (Sandbox Code Playgroud)
进一步更新
实际上您不需要第一次读取,无论如何,列长度可以通过第一列中的列数隐式导出:
In [128]:
t="""abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232"""
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, len(cols)+1)), names = cols)
df.to_dict()
Out[128]:
{'abc': {0: -0.12300000000000001,
1: -0.98080000000000001,
2: 0.23123000000000002},
'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
Run Code Online (Sandbox Code Playgroud)