TensorFlow：如何使用“tf.data”而不是“load_csv_without_header”？

Question

TensorFlow：如何使用“tf.data”而不是“load_csv_without_header”？

roi*_*hik 1 python pycharm deep-learning tensorflow tensorflow-datasets

2 年前，我在 TensorFlow 中编写代码，作为数据加载的一部分，我使用了函数“load_csv_without_header”。现在，当我运行代码时，我收到消息：

WARNING:tensorflow:From C:\Users\Roi\Desktop\Code_Win_Ver\code_files\Tensor_Flow\version1\build_database_tuple.py:124: load_csv_without_header (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data instead.

Run Code Online (Sandbox Code Playgroud)

如何使用 'tf.data' 而不是当前函数？如果没有带有 tf.data 的 csv 标头，我如何才能以相同的格式使用相同的 dtype？我在 Python 3.5 上使用 TF 版本 1.8.0。

感谢你的帮助！

Answer 1

onl*_*tom 5

使用`tf.data`同一个工作`csv`文件：

来自 TensorFlow 的官方文档：

tf.data 模块包含一组类，允许您轻松加载数据、操作数据并将其通过管道传输到模型中。

使用 APItf.data.Dataset旨在作为与 TensorFlow 中的数据交互的新标准。它表示“一系列元素，其中每个元素包含一个或多个 Tensor 对象”。对于 CSV，元素只是一行训练示例，表示为一对分别对应于数据（our x）和标签（“目标”）的张量。

使用 API，提取 tensorflow 数据集 ( tf.data.Dataset)中的每一行（或更准确地每个元素）的主要方法是使用迭代器，TensorFlow 有一个tf.data.Iterator为此命名的 API 。例如，要返回下一行，我们可以调用get_next()Iterator。

现在进入代码以将csv其转换为我们的 tensorflow 数据集。

方法一：`tf.data.TextLineDataset()`和`tf.decode_csv()`

使用 TensorFlow 的 Estimator API 的更新版本load_csv_without_header，您将读取 CSV 或使用更通用的tf.data.TextLineDataset(you_train_path)代替。skip()如果有标题行，您可以将其链接以跳过第一行，但在您的情况下，这不是必需的。

然后，您可以使用tf.decode_csv()将 CSV 的每一行解码打包到其各自的字段中。

代码解决方案：

import tensorflow as tf
train_path = 'data_input/iris_training.csv'
# if no header, remove .skip()
trainset = tf.data.TextLineDataset(train_path).skip(1)

# Metadata describing the text columns
COLUMNS = ['SepalLength', 'SepalWidth',
           'PetalLength', 'PetalWidth',
           'label']
FIELD_DEFAULTS = [[0.0], [0.0], [0.0], [0.0], [0]]
def _parse_line(line):
    # Decode the line into its fields
    fields = tf.decode_csv(line, FIELD_DEFAULTS)

    # Pack the result into a dictionary
    features = dict(zip(COLUMNS,fields))

    # Separate the label from the features
    label = features.pop('label')

    return features, label

trainset = trainset.map(_parse_line)
print(trainset)

Run Code Online (Sandbox Code Playgroud)

你会得到：

<MapDataset shapes: ({
    SepalLength: (), 
    SepalWidth: (), 
    PetalLength: (), 
    PetalWidth: ()}, ()), 
types: ({
    SepalLength: tf.float32, 
    SepalWidth: tf.float32, 
    PetalLength: tf.float32, 
    PetalWidth: tf.float32}, tf.int32)>

Run Code Online (Sandbox Code Playgroud)

您可以验证output classes：

{'PetalLength': tensorflow.python.framework.ops.Tensor,
  'PetalWidth': tensorflow.python.framework.ops.Tensor,
  'SepalLength': tensorflow.python.framework.ops.Tensor,
  'SepalWidth': tensorflow.python.framework.ops.Tensor},
 tensorflow.python.framework.ops.Tensor)

Run Code Online (Sandbox Code Playgroud)

您还可以使用get_next迭代器进行迭代：

x = trainset.make_one_shot_iterator()
x.next()
# Output:
({'PetalLength': <tf.Tensor: id=165, shape=(), dtype=float32, numpy=1.3>,
  'PetalWidth': <tf.Tensor: id=166, shape=(), dtype=float32, numpy=0.2>,
  'SepalLength': <tf.Tensor: id=167, shape=(), dtype=float32, numpy=4.4>,
  'SepalWidth': <tf.Tensor: id=168, shape=(), dtype=float32, numpy=3.2>},
 <tf.Tensor: id=169, shape=(), dtype=int32, numpy=0>)

Run Code Online (Sandbox Code Playgroud)

方法二：`from_tensor_slices()`从numpy或pandas构造数据集对象

train, test = tf.keras.datasets.mnist.load_data()
mnist_x, mnist_y = train

mnist_ds = tf.data.Dataset.from_tensor_slices(mnist_x)
print(mnist_ds)
# returns: <TensorSliceDataset shapes: (28,28), types: tf.uint8>

Run Code Online (Sandbox Code Playgroud)

另一个（更详细的）例子：

import pandas as pd

california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
# Define the input feature: total_rooms
my_feature = california_housing_dataframe[["total_rooms"]]

# Configure a numeric feature column for total_rooms
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

# Define the label
targets = california_housing_dataframe["median_house_value"]

# Convert pandas data into a dict of np arrays.
features = {key:np.array(value) for key,value in dict(features).items()}                                           

# Construct a dataset, and configure batching/repeating.
ds = tf.data.Dataset.from_tensor_slices((features,targets))

Run Code Online (Sandbox Code Playgroud)

我也强烈建议这篇文章和这个，无论是从官方文档; 可以肯定地说，这应该涵盖大部分（如果不是全部）用例，并将帮助您从已弃用的load_csv_without_header()功能迁移。

归档时间：	7 年，3 月前
查看次数：	3415 次
最近记录：	7 年前

TensorFlow：如何使用“tf.data”而不是“load_csv_without_header”？

使用tf.data同一个工作csv文件：

方法一：tf.data.TextLineDataset()和tf.decode_csv()

方法二：from_tensor_slices()从numpy或pandas构造数据集对象

使用`tf.data`同一个工作`csv`文件：

方法一：`tf.data.TextLineDataset()`和`tf.decode_csv()`

方法二：`from_tensor_slices()`从numpy或pandas构造数据集对象