标签: tensorflow-datasets

Tensorflow Dataset.from_tensor_slices 耗时太长

我有以下代码：

data = np.load("data.npy")
print(data) # Makes sure the array gets loaded in memory
dataset = tf.contrib.data.Dataset.from_tensor_slices((data))

Run Code Online (Sandbox Code Playgroud)

该文件"data.npy"为 3.3 GB。使用 numpy 读取文件需要几秒钟，但是创建 tensorflow 数据集对象的下一行需要很长时间才能执行。这是为什么？它在幕后做什么？

python numpy tensorflow tensorflow-datasets

nik*_*iko

2017 10-21

7
推荐指数

1
解决办法

3498
查看次数

TensorFlow自定义估算器 - 在model_fn中进行小的更改后恢复模型

我tf.estimator.Estimator用来开发我的模型,

我写了一个model_fn并训练了50,000次迭代,现在我想对我做一个小改动model_fn,例如添加一个新图层.

我不想从头开始训练,我想从50,000检查点恢复所有旧变量,并从这一点继续训练.当我尝试这样做时,我得到了一个NotFoundError

怎么办tf.estimator.Estimator呢？

tensorflow tensorflow-datasets tensorflow-estimator

mtn*_*gld

2018 01-05

7
推荐指数

1
解决办法

1933
查看次数

TensorFlow - 从TFRecords文件中读取视频帧

TLDR; 我的问题是如何从TFRecords加载压缩视频帧.

我正在建立一个数据管道,用于在大型视频数据集(Kinetics)上训练深度学习模型.为此,我使用的是TensorFlow,更具体地说是结构tf.data.Dataset和TFRecordDataset结构.由于数据集包含大约30万个10秒的视频,因此需要处理大量数据.在训练期间,我想从视频中随机采样64个连续帧,因此快速随机采样非常重要.为实现这一目标,培训期间可能存在许多数据加载方案:

来自视频的示例.使用ffmpegor OpenCV和示例帧加载视频.在视频中寻找是不太理想的,并且解码视频流比解码JPG要慢得多.
JPG图片.通过将所有视频帧提取为JPG来预处理数据集.这会生成大量文件,由于随机访问可能不会很快.
数据容器.将数据集预处理为TFRecords或HDF5文件.需要更多工作才能准备好管道,但最有可能是这些选项中最快的.

我决定使用选项(3)并使用TFRecord文件来存储数据集的预处理版本.但是,这也不像看起来那么简单,例如:

压缩.将视频帧存储为TFRecords中的未压缩字节数据将需要大量磁盘空间.因此,我提取所有视频帧,应用JPG压缩并将压缩字节存储为TFRecords.
视频数据.我们正在处理视频,因此TFRecords文件中的每个示例都会非常大并且包含几个视频帧(对于10秒的视频,通常为250-300,具体取决于帧速率).

我编写了以下代码来预处理视频数据集,并将视频帧写为TFRecord文件(每个大小约为5GB):

def _int64_feature(value):
    """Wrapper for inserting int64 features into Example proto."""
    if not isinstance(value, list):
        value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def _bytes_feature(value):
    """Wrapper for inserting bytes features into Example proto."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


with tf.python_io.TFRecordWriter(output_file) as writer:

  # Read and resize all video frames, np.uint8 of size [N,H,W,3]
  frames = ... …

Run Code Online (Sandbox Code Playgroud)

python deep-learning tensorflow tfrecord tensorflow-datasets

ver*_*man

2018 02-17

7
推荐指数

2
解决办法

4039
查看次数

为什么TensorFlow Estimator API将输入作为lambda？

该tf.estimatorAPI需要输入"输入功能"返回Dataset秒.例如,Estimator.train()拿一个input_fn(文档).

在我看过的例子中,无论何时手动提供此函数,它都是无争议的lambda.

这是不是意味着函数总是返回相同的值？或者它是否多次调用而没有参数？我无法找到关于此的文档.为什么函数不像train()只是Dataset明确地输入？

python tensorflow tensorflow-datasets tensorflow-estimator

jkf*_*kff

lucky-day

7
推荐指数

1
解决办法

626
查看次数

Tensorflow/models 使用 COCO 90 类 ID 虽然 COCO 只有 80 个类别

Tensorflows object_detection 项目的 labelmaps 包含 90 个类，虽然 COCO 只有 80 个类别。因此num_classes所有示例配置中的参数都设置为 90。

如果我现在下载并使用 COCO 2017 数据集，我需要将此参数设置为 80 还是保留为 90？

如果80（因为COCO有80个类）我需要调整labelmap，所以标准mscoco_label_map.pbtxt不正确，对吧？

如果有人能对此有所启发，我将非常感激:)

以下是标准的 80 个 COCO 类：

person
bicycle
car
motorbike
aeroplane
bus
train
truck
boat
traffic light
fire hydrant
stop sign
parking meter
bench
bird
cat
dog
horse
sheep
cow
elephant
bear
zebra
giraffe
backpack
umbrella
handbag
tie
suitcase
frisbee
skis
snowboard
sports ball
kite
baseball bat
baseball glove
skateboard
surfboard
tennis racket
bottle
wine glass …

Run Code Online (Sandbox Code Playgroud)

tensorflow tfrecord tensorflow-datasets

gus*_*avz

2018 06-05

7
推荐指数

1
解决办法

6653
查看次数

具有可变批量大小的 TensorFlow DataSet `from_generator`

我正在尝试使用 TensorFlow Dataset API 读取 HDF5 文件，使用该from_generator方法。除非批量大小没有均匀地划分为事件数量，否则一切正常。我不太明白如何使用 API 进行灵活的批处理。

如果事情没有平均分配，你会得到如下错误：

2018-08-31 13:47:34.274303: W tensorflow/core/framework/op_kernel.cc:1263] Invalid argument: ValueError: `generator` yielded an element of shape (1, 28, 28, 1) where an element of shape (11, 28, 28, 1) was expected.
Traceback (most recent call last):

  File "/Users/perdue/miniconda3/envs/py3a/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 206, in __call__
    ret = func(*args)

  File "/Users/perdue/miniconda3/envs/py3a/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 452, in generator_py_func
    "of shape %s was expected." % (ret_array.shape, expected_shape))

ValueError: `generator` yielded an element of shape (1, 28, 28, 1) where …

Run Code Online (Sandbox Code Playgroud)

python tensorflow tensorflow-datasets

Gab*_*due

lucky-day

7
推荐指数

1
解决办法

6127
查看次数

使用 tf.contrib.data.parallel_interleave 并行化 tf.from_generator

我有一堆 JSON 数组文件（准确地说是 AVRO），每个文件都会产生多个样本来训练 Keras 模型。使用来自@GPhilo和@jsimsa 的想法，我能够想出这个来并行化我的输入管道。无法弄清楚如何设计generator(n)来划分处理文件的工作。代码在内部失败，parse_file(f)因为该函数需要一个字符串文件路径而不是一个Tensor,

N = num_cores = 2
files_to_process = ["f1.avro", "f2.avro", "f3.avro"]
shuffle_size = prefetch_buffer = 1000
batch_size = 512

def generator(n):
    size = math.ceil(len(files_to_process) / N)
    start_index = n * size
    end_index = start_index + size

    def gen():
        # for f in files_to_process[start_index:end_index]:
        for f in tf.slice(files_to_process, start_index, size):
            yield f

    return gen

def dataset(n):
    return tf.data.Dataset.from_generator(generator(n), (tf.string,))

def process_file(f):
    examples_x, examples_y = parse_file(f) …

Run Code Online (Sandbox Code Playgroud)

python keras tensorflow tensorflow-datasets

Nit*_*tin

2018 09-05

7
推荐指数

1
解决办法

1974
查看次数

tf.data 内存泄漏

我在tf.data.Datasetfor 循环内部创建了一个，我注意到在每次迭代后内存并没有像人们期望的那样被释放。

有没有办法从 TensorFlow 请求释放内存？

我尝试使用tf.reset_default_graph()，我尝试调用del相关的 python 对象，但这不起作用。

唯一似乎有效的是gc.collect(). 不幸的是，gc.collect不适用于一些更复杂的例子。

完全可重现的代码：

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import psutil
%matplotlib inline

memory_used = []
for i in range(500):
    data = tf.data.Dataset.from_tensor_slices(
                    np.random.uniform(size=(10, 500, 500)))\
                    .prefetch(64)\
                    .repeat(-1)\
                    .batch(3)
    data_it = data.make_initializable_iterator()
    next_element = data_it.get_next()

    with tf.Session() as sess:
        sess.run(data_it.initializer)
        sess.run(next_element)
    memory_used.append(psutil.virtual_memory().used / 2 ** 30)
    tf.reset_default_graph()

plt.plot(memory_used)
plt.title('Evolution of memory')
plt.xlabel('iteration')
plt.ylabel('memory used (GB)')

Run Code Online (Sandbox Code Playgroud)

python memory-leaks tensorflow tensorflow-datasets

BiB*_*iBi

2019 03-18

7
推荐指数

1
解决办法

4650
查看次数

Tensorflow model.fit() 使用数据集生成器

我正在使用数据集 API 生成训练数据并将其分类为 NN 的批次。

这是我的代码的最小工作示例：

import tensorflow as tf
import numpy as np
import random


def my_generator():
    while True:
        x = np.random.rand(4, 20)
        y = random.randint(0, 11)
        label = tf.one_hot(y, depth=12)
        yield x.reshape(4, 20, 1), label

def my_input_fn():
    dataset = tf.data.Dataset.from_generator(lambda: my_generator(),
                                             output_types=(tf.float64, tf.int32))

    dataset = dataset.batch(32)
    iterator = dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()

    return batch_features, batch_labels


if __name__ == "__main__":
    tf.enable_eager_execution()

    model = tf.keras.Sequential([tf.keras.layers.Flatten(input_shape=(4, 20, 1)),
                                 tf.keras.layers.Dense(128, activation=tf.nn.relu),
                                 tf.keras.layers.Dense(12, activation=tf.nn.softmax)])

    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    data_generator = my_input_fn()
    model.fit(data_generator)

Run Code Online (Sandbox Code Playgroud)

该代码在 …

python tensorflow tensorflow-datasets tf.keras tensorflow2.0

ber*_*lem

2019 03-27

7
推荐指数

1
解决办法

8380
查看次数

如何在张量流中将“张量”转换为“numpy”数组？

我试图在 tesnorflow2.0 版本中将张量转换为 numpy。由于 tf2.0 启用了急切执行，因此它应该默认工作并且在正常运行时也工作。当我在 tf.data.Dataset API 中执行代码时，它给出了一个错误

“AttributeError: 'Tensor' 对象没有属性 'numpy'”

我在 tensorflow 变量之后尝试了“.numpy()”，而对于“.eval()”，我无法获得默认会话。

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
# tf.executing_eagerly()
import os
import time
import matplotlib.pyplot as plt
from IPython.display import clear_output
from model.utils import  get_noise
import cv2


def random_noise(input_image):
  img_out = get_noise(input_image)
  return img_out


def load_denoising(image_file):
  image = tf.io.read_file(image_file)
  image = tf.image.decode_png(image)
  real_image = image
  input_image = random_noise(image.numpy())
  input_image = tf.cast(input_image, tf.float32)
  real_image = tf.cast(real_image, tf.float32)
  return input_image, real_image


def …

Run Code Online (Sandbox Code Playgroud)

python tensorflow tensorflow-datasets tensorflow2.0

vik*_*ena

2019 05-10

7
推荐指数

1
解决办法

9329
查看次数