Tensorflow 对象检测 API 被终止 - OOM。如何减少随机缓冲区大小？

Question

Tensorflow 对象检测 API 被终止 - OOM。如何减少随机缓冲区大小？

dpa*_*don 7 python tensorflow tfrecord object-detection-api

系统信息

操作系统平台和发行版：CentOS 7.5.1804
TensorFlow 安装自：pip install tensorflow-gpu
TensorFlow 版本：tensorflow-gpu 1.8.0
CUDA/cuDNN 版本：9.0/7.1.2
GPU 型号和内存：GeForce GTX 1080 Ti，11264MB
重现的确切命令：

python train.py --logtostderr --train_dir=./models/train --pipeline_config_path=mask_rcnn_inception_v2_coco.config

描述问题

我正在尝试在我自己的数据集上训练 Mask-RCNN 模型（从在 COCO 上训练的模型进行微调），但是一旦填充了 shuffle 缓冲区，该过程就会终止。

在此之前，nvidia-smi 显示内存使用量约为 10669MB/11175MB，但 GPU 使用率仅为 1%。

我尝试调整以下 train_config 设置：

batch_size: 1    
batch_queue_capacity: 10    
num_batch_queue_threads: 4    
prefetch_queue_capacity: 5

Run Code Online (Sandbox Code Playgroud)

对于 train_input_reader：

num_readers: 1
queue_capacity: 10
min_after_dequeue: 5

Run Code Online (Sandbox Code Playgroud)

我相信我的问题类似于TensorFlow Object Detection API - Out of Memory，但我使用的是 GPU 而不是 CPU。

我正在训练的图像比较大（2048*2048），但是我想避免缩小尺寸，因为要检测的对象非常小。我的训练集包含 400 张图像（在 .tfrecord 文件中）。

有没有办法减少 shuffle 缓冲区的大小，看看这是否减少了内存需求？

追溯

INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
2018-06-19 12:21:33.487840: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 97 of 2048
2018-06-19 12:21:43.547326: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 231 of 2048
2018-06-19 12:21:53.470634: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 381 of 2048
2018-06-19 12:21:57.030494: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
Killed

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 5

您可以尝试以下步骤：

1.设置batch_size=1（或尝试自己的）

2.更改"default value": optional uint32 shuffle_buffer_size = 11 [default = 256]（或尝试自己的）代码在这里

models/research/object_detection/protos/input_reader.proto

Line 40 in ce03903

 optional uint32 shuffle_buffer_size = 11 [default = 2048];

Run Code Online (Sandbox Code Playgroud)

原集是：

optional uint32 shuffle_buffer_size = 11 [default = 2048]

Run Code Online (Sandbox Code Playgroud)

默认值是2048，它太大了batch_size=1，应该相应修改，我认为它会消耗大量RAM。

3.重新编译Protobuf库

来自张量流/模型/研究/

protoc object_detection/protos/*.proto --python_out=.

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，8 月前
查看次数：	5004 次
最近记录：	4 年，11 月前