Rob*_*Rob 80 python csv tensorflow
我对TensorFlow的世界相对较新,并且非常困惑于你如何将CSV数据实际读入TensorFlow中的可用示例/标签张量.TensorFlow教程中有关读取CSV数据的示例相当分散,只能让您了解能够训练CSV数据的方法.
这是我根据CSV教程编写的代码:
from __future__ import print_function
import tensorflow as tf
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
filename = "csv_test_data.csv"
# setup text reader
file_length = file_len(filename)
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader(skip_header_lines=1)
_, csv_row = reader.read(filename_queue)
# setup CSV decoding
record_defaults = [[0],[0],[0],[0],[0]]
col1,col2,col3,col4,col5 = tf.decode_csv(csv_row, record_defaults=record_defaults)
# turn features back into a tensor
features = tf.stack([col1,col2,col3,col4])
print("loading, " + str(file_length) + " line(s)\n")
with tf.Session() as sess:
tf.initialize_all_variables().run()
# start populating filename queue
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(file_length):
# retrieve a single instance
example, label = sess.run([features, col5])
print(example, label)
coord.request_stop()
coord.join(threads)
print("\ndone loading")
Run Code Online (Sandbox Code Playgroud)
以下是我正在加载的CSV文件的简短示例 - 非常基本的数据 - 4个要素列和1个标签列:
0,0,0,0,0
0,15,0,0,0
0,30,0,0,0
0,45,0,0,0
Run Code Online (Sandbox Code Playgroud)
上面的所有代码都是逐个打印CSV文件中的每个示例,虽然很好,但对于培训来说却非常无用.
我在这里努力的是你如何将这些逐个加载的单个示例转换为训练数据集.例如,这是我在Udacity深度学习课程中学习的笔记本.我基本上想要获取我正在加载的CSV数据,并将其放入类似train_dataset和train_labels的内容:
def reformat(dataset, labels):
dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
# Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
Run Code Online (Sandbox Code Playgroud)
我试过tf.train.shuffle_batch像这样使用,但它只是莫名其妙地挂起:
for i in range(file_length):
# retrieve a single instance
example, label = sess.run([features, colRelevant])
example_batch, label_batch = tf.train.shuffle_batch([example, label], batch_size=file_length, capacity=file_length, min_after_dequeue=10000)
print(example, label)
Run Code Online (Sandbox Code Playgroud)
总结一下,这是我的问题:
for i in range(file_length)行数(上面的代码行)感觉非常不优雅编辑: 雅罗斯拉夫指出,我可能在这里混合了命令性和图形构造部分,它开始变得更加清晰.我能够将以下代码整合在一起,我认为这些代码更接近于从CSV训练模型时通常所做的(不包括任何模型训练代码):
from __future__ import print_function
import numpy as np
import tensorflow as tf
import math as math
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('dataset')
args = parser.parse_args()
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
def read_from_csv(filename_queue):
reader = tf.TextLineReader(skip_header_lines=1)
_, csv_row = reader.read(filename_queue)
record_defaults = [[0],[0],[0],[0],[0]]
colHour,colQuarter,colAction,colUser,colLabel = tf.decode_csv(csv_row, record_defaults=record_defaults)
features = tf.stack([colHour,colQuarter,colAction,colUser])
label = tf.stack([colLabel])
return features, label
def input_pipeline(batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer([args.dataset], num_epochs=num_epochs, shuffle=True)
example, label = read_from_csv(filename_queue)
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size
example_batch, label_batch = tf.train.shuffle_batch(
[example, label], batch_size=batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
file_length = file_len(args.dataset) - 1
examples, labels = input_pipeline(file_length, 1)
with tf.Session() as sess:
tf.initialize_all_variables().run()
# start populating filename queue
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
example_batch, label_batch = sess.run([examples, labels])
print(example_batch)
except tf.errors.OutOfRangeError:
print('Done training, epoch reached')
finally:
coord.request_stop()
coord.join(threads)
Run Code Online (Sandbox Code Playgroud)
Yar*_*tov 23
我认为你在这里混合了命令式和图形构造部件.该操作tf.train.shuffle_batch创建一个新的队列节点,单个节点可用于处理整个数据集.所以我认为你是挂起的,因为你shuffle_batch在for循环中创建了一堆队列而没有为它们启动队列运行器.
正常输入管道使用情况如下所示:
shuffle_batch输入管道---图形结构的结束,命令式编程的开始 -
tf.start_queue_runnerswhile(True): session.run()为了更具可伸缩性(避免Python GIL),您可以使用TensorFlow管道生成所有数据.但是,如果性能不重要,可以使用slice_input_producer.以下示例将一个numpy数组连接到输入管道,其中包含一些Print节点以查看正在发生的事情(Print运行节点时转到stdout的消息)
tf.reset_default_graph()
num_examples = 5
num_features = 2
data = np.reshape(np.arange(num_examples*num_features), (num_examples, num_features))
print data
(data_node,) = tf.slice_input_producer([tf.constant(data)], num_epochs=1, shuffle=False)
data_node_debug = tf.Print(data_node, [data_node], "Dequeueing from data_node ")
data_batch = tf.batch([data_node_debug], batch_size=2)
data_batch_debug = tf.Print(data_batch, [data_batch], "Dequeueing from data_batch ")
sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())
tf.get_default_graph().finalize()
tf.start_queue_runners()
try:
while True:
print sess.run(data_batch_debug)
except tf.errors.OutOfRangeError as e:
print "No more inputs."
Run Code Online (Sandbox Code Playgroud)
你应该看到这样的东西
[[0 1]
[2 3]
[4 5]
[6 7]
[8 9]]
[[0 1]
[2 3]]
[[4 5]
[6 7]]
No more inputs.
Run Code Online (Sandbox Code Playgroud)
"8,9"数字没有填满整批,所以它们没有生产.也tf.Print打印到sys.stdout,所以它们在终端中单独显示给我.
PS:连接batch到手动初始化队列的最小值是在github问题2193中
此外,出于调试目的,您可能希望timeout在会话中设置,以便您的IPython笔记本不会挂起空队列出列.我在会话中使用了这个辅助函数
def create_session():
config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
config.operation_timeout_in_ms=60000 # terminate on long hangs
# create interactive session to register a default session
sess = tf.InteractiveSession("", config=config)
return sess
Run Code Online (Sandbox Code Playgroud)
可伸缩性说明:
tf.constant将数据的副本内联到图表中.Graph定义的大小有2GB的基本限制,因此这是数据大小的上限v=tf.Variable,并通过运行数据保存到有v.assign_op有tf.placeholder在右侧和喂养numpy的阵列到占位符(feed_dict)slice_input_producer在numpy数组上运行,并使用一次上传一行feed_dictNag*_*raj 13
或者你可以试试这个,代码使用pandas和numpy将Iris数据集加载到tensorflow中,并在会话中打印一个简单的神经元输出.希望它有助于基本的理解.... [我没有添加一个热解码标签的方式].
import tensorflow as tf
import numpy
import pandas as pd
df=pd.read_csv('/home/nagarjun/Desktop/Iris.csv',usecols = [0,1,2,3,4],skiprows = [0],header=None)
d = df.values
l = pd.read_csv('/home/nagarjun/Desktop/Iris.csv',usecols = [5] ,header=None)
labels = l.values
data = numpy.float32(d)
labels = numpy.array(l,'str')
#print data, labels
#tensorflow
x = tf.placeholder(tf.float32,shape=(150,5))
x = data
w = tf.random_normal([100,150],mean=0.0, stddev=1.0, dtype=tf.float32)
y = tf.nn.softmax(tf.matmul(w,x))
with tf.Session() as sess:
print sess.run(y)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
42590 次 |
| 最近记录: |