Kaggle TPU 不可用:无法连接到所有地址

Dac*_*eng 5 tensorflow tpu

我是机器学习新手。在尝试用TPU方法完成数字识别时,我遇到了以下问题。

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
Run Code Online (Sandbox Code Playgroud)
with strategy.scope():
    Model = Sequential([

        InputLayer((28, 28, 1)),
        Dropout(0.1),
        Conv2D(128, 3, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        MaxPooling2D(2, 2),
        Conv2D(64, 3, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        MaxPooling2D(2, 2),
        Flatten(),
        Dense(128, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        Dense(10, activation='softmax')

    ])

with strategy.scope():
    Model.compile(optimizer='adam',
                  loss='categorical_crossentropy', metrics='accuracy') 
Run Code Online (Sandbox Code Playgroud)
CancelledError: 4 root error(s) found.
  (0) Cancelled:  Operation was cancelled
     [[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (1) Cancelled:  Iterator was cancelled
     [[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (2) Cancelled:  Operation was cancelled
     [[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (3) Cancelled:  Iterator was cancelled
     [[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]

Function call stack:
train_function -> train_function -> train_function -> train_function
Run Code Online (Sandbox Code Playgroud)

然后我再次运行它

UnavailableError: 9 root error(s) found.
  (0) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[cond_11/switch_pred/_107/_78]]
  (1) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
  (2) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
  (3) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]

Function call stack:
train_function -> train_function -> train_function -> train_function
Run Code Online (Sandbox Code Playgroud)

一定是某个地方缺失了strategy.scopy():

我在其他笔记本上成功了,但它们都是tf.data.Dataset

尽管如此,我仍然无法弄清楚这一点。

完整代码位于 https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286

Version 6是TPU版本。并且仅Version 5根据上面的代码进行修改。

小智 0

您似乎在本地存储训练数据,这导致了问题,因为 TPU 只能访问 GCS 中的数据。

TPUs read training data exclusively from GCS (Google Cloud Storage)请参阅此处的详细信息

您还可以检查此 stackoverflow Colab TPU Error when Calling model.fit() : UnimplementedError帖子。