当已部署的容器完成时,如何使GCE实例停止?

Ada*_*dam 7 containers google-compute-engine

我有一个执行单个大型计算的Docker容器。此计算需要大量内存,并且需要大约12个小时才能运行。

我可以创建适当大小的Google Compute Engine VM,并使用“将容器映像部署到此VM实例”选项来完美运行此作业。但是,一旦作业完成,容器将退出,但VM仍在运行(并且正在充电)。

容器退出时如何使VM退出/停止/删除?

当VM处于其僵尸模式时,仅堆栈驱动器容器处于运行状态:

$ docker ps
CONTAINER ID        IMAGE                                                                COMMAND                  CREATED             STATUS              PORTS               NAMES
bfa2feb03180        gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1   "/entrypoint.sh /u..."   17 hours ago        Up 17 hours                             stackdriver-logging-agent
161439a487c2        gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.17-2    "/bin/sh -c /opt/s..."   17 hours ago        Up 17 hours         8000/tcp            stackdriver-metadata-agent
Run Code Online (Sandbox Code Playgroud)

我这样创建虚拟机:

gcloud beta compute --project=abc instances create-with-container vm-name \
                    --zone=us-central1-c --machine-type=custom-1-65536-ext \
                    --network=default --network-tier=PREMIUM --metadata=google-logging-enabled=true \
                    --maintenance-policy=MIGRATE \
                    --service-account=xyz \
                    --scopes=https://www.googleapis.com/auth/cloud-platform \
                    --image=cos-stable-69-10895-71-0 --image-project=cos-cloud --boot-disk-size=10GB \
                    --boot-disk-type=pd-standard --boot-disk-device-name=vm-name \
                    --container-image=gcr.io/abc/my-image --container-restart-policy=on-failure \
                    --container-command=python3 \
                    --container-arg="a" --container-arg="b" --container-arg="c" \
                    --labels=container-vm=cos-stable-69-10895-71-0
Run Code Online (Sandbox Code Playgroud)

Vin*_*ent 6

创建VM时,需要向其授予对计算的写访问权限,以便您可以从中删除实例。您还应该在此时设置容器环境变量,例如gce_zonegce_project_id。您将需要他们删除实例。

gcloud beta compute instances create-with-container {NAME} \
    --container-env=gce_zone={ZONE},gce_project_id={PROJECT_ID} \
    --service-account={SERVICE_ACCOUNT} \
    --scopes=https://www.googleapis.com/auth/compute,...
    ...
Run Code Online (Sandbox Code Playgroud)

然后在容器中,每当您确定任务完成时:

  1. 请求api令牌(im为方便起见,使用curl和DEFAULT gce服务帐户)
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"
Run Code Online (Sandbox Code Playgroud)

这将以看起来像的json进行响应

{
  "access_token": "foobarbaz...",
  "expires_in": 1234,
  "token_type": "Bearer"
}
Run Code Online (Sandbox Code Playgroud)
  1. 获取该访问令牌并点击instances.delete api端点(注意环境变量)
curl -XDELETE -H 'Authorization: Bearer {TOKEN}' https://www.googleapis.com/compute/v1/projects/$gce_project_id/zones/$gce_zone/instances/$HOSTNAME
Run Code Online (Sandbox Code Playgroud)

  • 还有其他人在尝试此操作时遇到“请求的身份验证范围不足”错误吗? (2认同)

Ada*_*dam 5

我根据文森特的答案编写了一个独立的Python函数。

def kill_vm():
    """
    If we are running inside a GCE VM, kill it.
    """
    # based on /sf/ask/3692383271/
    import json
    import logging
    import requests

    # get the token
    r = json.loads(
        requests.get("http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token",
                     headers={"Metadata-Flavor": "Google"})
            .text)

    token = r["access_token"]

    # get instance metadata
    # based on https://cloud.google.com/compute/docs/storing-retrieving-metadata
    project_id = requests.get("http://metadata.google.internal/computeMetadata/v1/project/project-id",
                              headers={"Metadata-Flavor": "Google"}).text

    name = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/name",
                        headers={"Metadata-Flavor": "Google"}).text

    zone_long = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/zone",
                             headers={"Metadata-Flavor": "Google"}).text
    zone = zone_long.split("/")[-1]

    # shut ourselves down
    logging.info("Calling API to delete this VM, {zone}/{name}".format(zone=zone, name=name))

    requests.delete("https://www.googleapis.com/compute/v1/projects/{project_id}/zones/{zone}/instances/{name}"
                    .format(project_id=project_id, zone=zone, name=name),
                    headers={"Authorization": "Bearer {token}".format(token=token)})
Run Code Online (Sandbox Code Playgroud)

一个简单的atexit钩子让我得到我想要的行为:

import atexit
atexit.register(kill_vm)
Run Code Online (Sandbox Code Playgroud)

  • 可以理解。我怀疑这些调用是由本地主机或非常接近它的东西提供服务的,因为返回值会立即返回。此外,这只是虚拟机关闭的 3 个调用,另外 2 个无论如何都是不可避免的。所以我对此表示同意:) 优点是该方法是独立的,并且在部署过程中不需要小心。 (2认同)

dap*_*hez 5

解决了一段时间后,这里提供了一个很好的完整解决方案。

此解决方案不使用“带有容器映像的启动计算机”选项。相反,它使用启动脚本,该脚本更加灵活。您仍然使用容器优化的OS实例实例。

  1. 创建启动脚本:
#!/usr/bin/env bash

# get image name and container parameters from the metadata
IMAGE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/image_name -H "Metadata-Flavor: Google")

CONTAINER_PARAM=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/container_param -H "Metadata-Flavor: Google")

# This is needed if you are using a private images in GCP Container Registry
# (possibly also for the gcp log driver?)
sudo HOME=/home/root /usr/bin/docker-credential-gcr configure-docker

# Run! The logs will go to stack driver 
sudo HOME=/home/root  docker run --log-driver=gcplogs ${IMAGE_NAME} ${CONTAINER_PARAM}

# Get the zone
zoneMetadata=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
ZONE="${zoneMetadataSplit[3]}"

# Run compute delete on the current instance. Need to run in a container 
# because COS machines don't come with gcloud installed 
docker run --entrypoint "gcloud" google/cloud-sdk:alpine compute instances delete ${HOSTNAME}  --delete-disks=all --zone=${ZONE}
Run Code Online (Sandbox Code Playgroud)
  1. 将脚本放在公共场所。例如,将其放在Cloud Storage上并创建一个公共URL。您不能将gs://URI用于COS启动脚本。

  2. 使用启动实例startup-script-url,并传递图像名称和参数,例如:

gcloud compute --project=PROJECT_NAME instances create INSTANCE_NAME  \
--zone=ZONE --machine-type=TYPE \
--metadata=image_name=IMAGE_NAME,\
container_param="PARAM1 PARAM2 PARAM3",\
startup-script-url=PUBLIC_SCRIPT_URL \
--maintenance-policy=MIGRATE --service-account=SERVICE_ACCUNT \
--scopes=https://www.googleapis.com/auth/cloud-platform --image-family=cos-stable \
--image-project=cos-cloud --boot-disk-size=10GB --boot-disk-device-name=DISK_NAME
Run Code Online (Sandbox Code Playgroud)

(您可能想限制scopes,为了简单起见,示例使用完全访问权限)