我是机器学习新手。在尝试用TPU方法完成数字识别时,我遇到了以下问题。
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
Run Code Online (Sandbox Code Playgroud)
with strategy.scope():
Model = Sequential([
InputLayer((28, 28, 1)),
Dropout(0.1),
Conv2D(128, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Conv2D(64, 3, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
MaxPooling2D(2, 2),
Flatten(),
Dense(128, use_bias=False),
LeakyReLU(0.05),
BatchNormalization(),
Dense(10, activation='softmax')
])
with strategy.scope():
Model.compile(optimizer='adam',
loss='categorical_crossentropy', metrics='accuracy')
Run Code Online (Sandbox Code Playgroud)
CancelledError: 4 root error(s) found.
(0) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(1) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(2) Cancelled: Operation was cancelled
[[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
(3) Cancelled: Iterator was cancelled
[[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]
Function call stack:
train_function -> train_function -> train_function -> train_function
Run Code Online (Sandbox Code Playgroud)
然后我再次运行它
UnavailableError: 9 root error(s) found.
(0) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_11/switch_pred/_107/_78]]
(1) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
(2) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
(3) Unavailable: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]
Function call stack:
train_function -> train_function -> train_function -> train_function
Run Code Online (Sandbox Code Playgroud)
一定是某个地方缺失了strategy.scopy():
我在其他笔记本上成功了,但它们都是tf.data.Dataset
尽管如此,我仍然无法弄清楚这一点。
完整代码位于 https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286
Version 6是TPU版本。并且仅Version 5根据上面的代码进行修改。
小智 0
您似乎在本地存储训练数据,这导致了问题,因为 TPU 只能访问 GCS 中的数据。
TPUs read training data exclusively from GCS (Google Cloud Storage)请参阅此处的详细信息
您还可以检查此 stackoverflow Colab TPU Error when Calling model.fit() : UnimplementedError帖子。
| 归档时间: |
|
| 查看次数: |
1355 次 |
| 最近记录: |