并行性不会减少数据集映射的时间

Question

并行性不会减少数据集映射的时间

Kra*_*mar 9 tensorflow tensorflow-datasets

TF Map功能支持并行调用.我看到没有改进传递num_parallel_calls给地图.使用num_parallel_calls=1和num_parallel_calls=10,性能运行时间没有改善.这是一个简单的代码

import time
def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False, 
    batch_size=1, repeat=1, num_iterations=10):
    tf.reset_default_graph()
    start = time.time()
    dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
        squarer, [x], [tf.int64]), 
        num_parallel_calls=num_parallel_calls).repeat(repeat)
    if batch:
        dataset_x = dataset_x.batch(batch_size)
    dataset_y = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
       squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat)
    if batch:
        dataset_y = dataset_x.batch(batch_size)
        X = dataset_x.make_one_shot_iterator().get_next()
        Y = dataset_x.make_one_shot_iterator().get_next()

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        i = 0
        while True:
            try:
                res = sess.run([X, Y])
                i += 1
                if i == num_iterations:
                    break
            except tf.errors.OutOfRangeError as e:
                pass

Run Code Online (Sandbox Code Playgroud)

这是时间安排

%timeit test_two_custom_function_parallelism(num_iterations=1000, 
 num_parallel_calls=2, batch_size=2, batch=True)
370ms

%timeit test_two_custom_function_parallelism(num_iterations=1000, 
 num_parallel_calls=5, batch_size=2, batch=True)
372ms

%timeit test_two_custom_function_parallelism(num_iterations=1000, 
 num_parallel_calls=10, batch_size=2, batch=True)
384ms

Run Code Online (Sandbox Code Playgroud)

我%timeit在Juypter笔记本中使用过.我做错了什么？

Answer 1

mrr*_*rry 22

这里的问题是Dataset.map()函数中唯一的操作是tf.py_func()op.此操作调用回本地Python解释器以在同一进程中运行函数.增加num_parallel_calls会增加尝试同时回调Python的TensorFlow线程的数量.但是,Python有一种称为"全局解释器锁"的东西,它可以阻止多个线程同时执行代码.因此,除了其中一个并行调用之外,其他所有调用都将被阻止,等待获取全局解释器锁定,并且几乎没有并行加速(甚至可能略微减速).

您的代码示例不包含squarer()函数的定义,但可以替换tf.py_func()为纯CensorFlow操作,这些操作在C++中实现,并且可以并行执行.例如 - 只是通过名称猜测 - 您可以用调用来替换它tf.square(x),然后您可能会享受一些并行加速.

但请注意,如果函数中的工作量很小,例如平方一个整数,则加速可能不会很大.并行Dataset.map()对于较重的操作更有用,例如解析TFRecord tf.parse_single_example()或执行某些图像失真作为数据增强管道的一部分.

归档时间：	7 年，7 月前
查看次数：	1982 次
最近记录：	5 年，10 月前