Tas*_*lrs 5 python machine-learning hdf5 neural-network tensorflow
我有一个大数据集(300.000 个示例 x 33.000 个特征),这当然不适合内存。数据以 HDF5 格式保存。这些值大多为零(稀疏数据)。它们看起来像这样:
Attr1 52 52 52 52 52 52 52 52 ...
Attr2 umb umb umb umb umb umb umb umb ...
CellID TGC-1 TGG-1 CAG-1 TTC-1 GTG-1 GTA-1 CAA-1 CAC-1 ...
Acc Gene ...
243485 RP11-.3 0 0 0 0 0 0 0 0 ...
237613 FAM138A 0 0 0 0 0 0 0 0 ...
186092 OR4F5 0 0 0 0 0 0 0 0 ...
238009 RP11-.7 0 0 0 0 0 0 0 0 ...
239945 RP11-.8 0 0 0 0 0 0 0 0 ...
279457 FO538.2 0 0 0 0 0 0 0 0 ...
228463 AP006.2 0 0 0 0 0 0 0 0 ...
... ... ... ... ... ... ... ... ... ...
Run Code Online (Sandbox Code Playgroud)
我做了以下工作,在 TensorFlow 中加载整个数据集(loompy只是一个在后台使用 hdf5 的包):
import tensorflow as tf
import numpy as np
import loompy as lp
batch_size = 1000
with loompy.connect(filename, 'r') as ds:
ds_shape = (batch_size, ds.shape[0])
ds_dtype = ds[0:1, 0:1].dtype
labels = np.asarray([ds.ca.CellID, ds.ca.Attr1]).T
labels_shape = (batch_size, 1)
data_placeholder = tf.placeholder(ds_dtype, ds_shape)
labels_placeholder = tf.placeholder(labels[:,1].dtype, labels_shape)
dataset = tf.data.Dataset.from_tensor_slices((data_placeholder, labels_placeholder))
dataset = dataset.prefetch(batch_size)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
with loompy.connect(filename, 'r') as ds:
for i in range(0, ds.shape[1], batch_size):
batch = ds[0 : ds_shape[1], i : i + batch_size].T
batch_labels = np.asarray([ds.ca.CellID[i : i + batch_size],
ds.ca.Attr1[i : i + batch_size]]).T[:,1]
sess.run(iterator.initializer, feed_dict = {data_placeholder: batch,
labels_placeholder: batch_labels.reshape(batch_size, 1)})
for _ in range(batch_size):
print(sess.run(next_element))
Run Code Online (Sandbox Code Playgroud)
输出:
(数组([0, 0, 0, ..., 0, 0, 0], dtype=int32), 数组([b'52'], dtype=object))
(数组([0, 0, 0, ..., 0, 0, 0], dtype=int32), 数组([b'52'], dtype=object))
...
然而,通过这种方式,我无法在训练、测试和评估集中拆分我的数据。此外,我只能在每个批次内对它们进行洗牌,这不是有效的,因为大多数情况下批次上的数据属于同一类。
我如何操作这种数据才能将它们作为训练、测试、评估集加载,并执行改组等(最好尽可能使用我的 TitanX GPU)?
如果有人仍然对这个主题感兴趣,这是我对这个问题的解决方案。最后我坚持使用 Loompy 文件格式,因为它对我正在做的事情来说非常方便(在这里查看Loompy)。为了在我的模型中导入如此大量的信息,我使用了TensorFlow APIfrom_generator()的功能tf.data.Dataset。另外,我创建了一个生成器来根据需要生成数据。
下面是我的输入函数的样子:
import loompy as lp
import tensorflow as tf
from sklearn.model_selection import train_test_split
model_input_name = ""
input_size = 10000
batch_size = 32
epochs = 10
# Input functions for train, test and eval sets.
def train_input_fn():
return _input_fn('TRAIN')
def test_input_fn():
return _input_fn('TEST')
def eval_input_fn():
return _input_fn('EVAL')
# General purpose input function
def _input_fn(mode = 'TRAIN'):
"""
Arguments
mode : 'TRAIN', 'TEST', 'EVAL'
"""
# A generator to yield data and labels from the given FILE,
# based on the indices assigned to the "indices" variable.
# If you change the labels, remember to update the from_generator()
# parameters below, to reflect their datatype.
def gen():
with lp.connect(FILE, 'r') as ds:
if ae:
for i in indices:
yield {model_input_name: ds[:, i]}, ds[:, i]
else:
for i in indices:
yield {model_input_name: ds[:, i]}, ds.ca.x_CellType[i]
# Get the indices for train, test and eval sets
train_idx, test_idx, eval_idx = train_test_set_idx_split(TRAIN_RT, TEST_RT, EVAL_RT)
# Check condition and assign the respective set to the "indices" variable
if mode == 'TRAIN':
indices = train_idx
elif mode == 'TEST':
indices = test_idx
elif mode == 'EVAL':
indices = eval_idx
else:
print("Wrong mode choice: ", mode)
exit(1)
dataset = tf.data.Dataset.from_generator(gen, ({model_input_name: tf.int64}, tf.int64),
output_shapes=({model_input_name: [input_size,]}, []))
# Shuffle, batch, map, prefetch and repeat your dataset.
# If you need to do some preprocessing on the data, create your function on
# the cell above, and call it within a map() function.
dataset = dataset.shuffle(buffer_size=batch_size*50)
dataset = dataset.batch(batch_size)
dataset = dataset.map(_reshape_labels)
dataset = dataset.map(_int2float)
# Map on whatever other functions you need
dataset = dataset.map( ... )
dataset = dataset.prefetch(2)
dataset = dataset.repeat(epochs)
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
# Get train, test, eval indices for the given dataset
def train_test_set_idx_split(train_rt, test_rt, eval_rt):
""" This function returns indices for the train, test and evaluation sets,
given an input Dataset.
Arguments:
train_rt: ratio of the train dataset
test_rt: ratio of the test dataset
eval_rt: ratio of the evaluation dataset
Returns:
train_idx: indices (of the given dataset) for the train dataset
test_idx: indices (of the given dataset) for the test dataset
evel_idx: indices (of the given dataset) for the evaluation dataset
Note:
This function will work correctly as long as (test_rt == evel_rt) is True.
If you need (test_rt != evel_rt), you need something more sophisticated.
"""
with lp.connect(FILE, 'r') as ds:
idx = np.array(range(0, ds.shape[1]))
train_idx, test_idx = train_test_split(idx, train_size=train_rt, test_size=test_rt+eval_rt)
test_idx, eval_idx = train_test_split(test_idx, train_size=0.5, test_size=0.5)
return train_idx, test_idx, eval_idx
# Reshape labels as needed
def _reshape_labels(data, labels):
return data, tf.reshape(labels, (-1,1))
Run Code Online (Sandbox Code Playgroud)