eLi*_*lie 6 classification neural-network python-3.x tensorflow google-cloud-tpu
我尝试Estimator使用TPUEstimatorAPI 实现基于Tensorflow的模型失败.它在训练期间遇到错误:
InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'CrossReplicaSum' with these attrs. Registered devices: [CPU], Registered kernels: <no registered kernels>
[[Node: CrossReplicaSum_5 = CrossReplicaSum[T=DT_FLOAT](gradients/dense_2/BiasAdd_grad/tuple/control_dependency_1)]]
Run Code Online (Sandbox Code Playgroud)
一开始也有警告,但我不确定它是否相关:
WARNING:tensorflow:CrossShardOptimizer should be used within a tpu_shard_context, but got unset number_of_shards. Assuming 1.
Run Code Online (Sandbox Code Playgroud)
这是模型函数的相关部分:
def model_fn(features, labels, mode, params):
"""A simple NN with two hidden layers of 10 nodes each."""
input_layer = tf.feature_column.input_layer(features, params['feature_columns'])
dense1 = tf.layers.dense(inputs=input_layer, units=10, activation=tf.nn.relu, kernel_initializer=tf.glorot_uniform_initializer())
dense2 = tf.layers.dense(inputs=dense1, units=10, activation=tf.nn.relu, kernel_initializer=tf.glorot_uniform_initializer())
logits = tf.layers.dense(inputs=dense2, units=4)
reshaped_logits = tf.reshape(logits, [-1, 1, 4])
onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=4)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=reshaped_logits)
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.contrib.tpu.CrossShardOptimizer(tf.train.AdagradOptimizer(learning_rate=0.05))
train_op = optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
Run Code Online (Sandbox Code Playgroud)
我正在TPUEstimator通过将--use_tpu标志设置为来尝试本地CPU执行False.在TPUEstimator被实例化和train被称为正是如此:
estimator_classifier = tf.contrib.tpu.TPUEstimator(
model_fn=model_fn,
model_dir="/tmp/estimator_classifier_logs",
config=tf.contrib.tpu.RunConfig(
session_config=tf.ConfigProto(
allow_soft_placement=True, log_device_placement=True),
tpu_config=tf.contrib.tpu.TPUConfig()
),
train_batch_size=DEFAULT_BATCH_SIZE,
use_tpu=False,
params={
'feature_columns': feature_columns
}
)
tensors_to_log = {"probabilities": "softmax_tensor"}
logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=50)
estimator_classifier.train(
input_fn=data_factory.make_tpu_train_input_fn(train_x, train_y, DEFAULT_BATCH_SIZE),
steps=DEFAULT_STEPS,
hooks=[logging_hook]
)
Run Code Online (Sandbox Code Playgroud)
这个错误是什么意思,我该如何排除故障?
上下文不清楚。
您的工作是在 Cloud TPU 环境中运行还是在某些具有 TPU 硬件的环境中运行?
如果不是,这是预期的。TPUEstimator 设计主要用于 Cloud TPU 环境,其中后端工作程序将所有内核正确链接到 Tensorflow 服务器。CrossReplicaSum 是为设备 TPU(而非 CPU)注册的内核的一部分。
如果是,您的主地址设置是否正确。根据日志,您的 TensorFlow 会话主机中似乎没有 TPU 设备。如果您在 Cloud TPU 中运行作业,您可以执行以下操作
with tf.Session('<replace_with_your_worker_address>') as sess:
print(sess.list_devices())
Run Code Online (Sandbox Code Playgroud)
您至少应该看到像这样的设备"/<some_thing_varies_in_your_env>/device:TPU:0"。