Running ECS task in a cluster within a private subnet remains in provisioning status

ain*_*sti 5 amazon-s3 amazon-web-services amazon-ecs terraform

We want to build an ECS cluster with the following characteristics:

  1. It must run inside a VPC, then, we need the awsvpc mode
  2. It must use GPU instances, so we can't use Fargate
  3. It must provision dynamically the instances, therefore, we need a capacity provider
  4. It will run tasks (batch jobs) that are going to be triggered directly through the AWS ECS API. For this reason, we don't need a service, only a task definition.
  5. These tasks must have access to S3 (internet), so according to AWS documentation the instances must be placed inside a private subnet (a reference to docs).

We've already read this post in stackoverflow where it says that we need to set up a private subnet with a route table that points to a NAT Gateway configured in a public subnet, and this public subnet should point to an internet gateway. We already have this configuration. We also have an S3 vpc endpoint configured in the route table.

Bellow, you can see some relevant configurations of the cluster in terraform (for the shake of simplicity I only put the relevant parts):


# Launch template
resource "aws_launch_template" "train-launch-template" {
  name_prefix   = "{var.project_name}-launch-template-${var.env}"
  image_id      = "ami-01f62a207c1d180d2"
  instance_type = "m5.large"
  key_name="XXXXXX"
  iam_instance_profile {
    name = aws_iam_instance_profile.ecs-instance-profile.name
  }
  user_data = base64encode(data.template_file.user_data.rendered)

  network_interfaces {
    associate_public_ip_address = false
    security_groups = [aws_security_group.ecs_service.id]
  }
}


# Task definition
resource "aws_ecs_task_definition" "task" {
  family                   = "${var.project_name}-${var.env}-train-task"
  execution_role_arn       = data.aws_iam_role.ecs_task_execution_role.arn
  task_role_arn            = aws_iam_role.ecs_train_task_role.arn
  requires_compatibilities = ["EC2"]
  cpu                      = var.ecs_cpu
  network_mode             = "awsvpc"
  memory                   = var.ecs_memory
  container_definitions    = data.template_file.app_definition.rendered

  tags = {
    Stage   = var.env_tag
    Project = var.project_name_tag
  }
}


# Cluster
resource "aws_ecs_cluster" "cluster" {
  name = "${var.project_name}-${var.env}-train-ecs-cluster"
  capacity_providers = [aws_ecs_capacity_provider.train-capacity-provider.name]
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.train-capacity-provider.name
  }
  tags = {
    Project = var.project_name_tag
    Stage   = var.env_tag
  }
}
Run Code Online (Sandbox Code Playgroud)

We also have configured all the roles needed for the instances and the task to access to the required resources (S3, ECR, ECS).

The AMI corresponds to an ECS optimized instance (the last version published at this moment in eu-west-1).

In the launch template we've removed the public IP to the instances due to the explanation in this link

我们已经发展到这种配置,试图使其工作,但我们一次又一次面临同样的问题:当任务被触发时,容量提供者启动一个实例,但任务永远不会被放置在容器实例中并保持不变无限期地处于 PROVISIONING 状态。

使用相同的配置,但将实例放入公共子网中,任务将被放入容器实例中,但是,正如第一个链接中警告的那样,任务无法访问互联网。

我们需要一些启示或踪迹来追随。先感谢您。

更新:根据要求,我添加了有关自动缩放的其余部分

resource "aws_autoscaling_group" "train-autoscaling" {
  availability_zones = ["eu-west-1b"]
  desired_capacity   = 0
  max_size           = 10
  min_size           = 0
  protect_from_scale_in = true
  

  launch_template {
    id      = aws_launch_template.train-launch-template.id
    version = "$Latest"
  }

  tags = [
    {
      key = "Project",
      value = var.project_name_tag
      propagate_at_launch = true
    },
    {
      key = "Stage",
      value = var.env_tag
      propagate_at_launch = true
    }
  ]
}

resource "aws_ecs_capacity_provider" "train-capacity-provider" {
  name = "${var.project_name}-${var.env}-train-capacity-provider"

  auto_scaling_group_provider {
    auto_scaling_group_arn         = aws_autoscaling_group.train-autoscaling.arn
    managed_termination_protection = "ENABLED"

    managed_scaling {
      status                    = "ENABLED"
      target_capacity           = 100
      maximum_scaling_step_size = 1
      minimum_scaling_step_size = 1
    }
  }
}

data "template_file" "user_data" {
  template = "${file("${path.module}/user_data.sh")}"

  vars = {
    cluster_name = "${var.project_name}-${var.env}-train-ecs-cluster"
  }
}
Run Code Online (Sandbox Code Playgroud)

更新 2(AWS 控制台信息):

容器实例运行 容器实例运行

详细容器实例: 在此输入图像描述

待处理任务: 待处理任务

待处理任务详细信息: 待处理任务详细信息

更新3:

30 分钟后,任务停止,并显示以下消息(任务无法启动): 在此输入图像描述

更新4:

来自容器实例的日志。ecs-agent.log

level=info time=2020-08-28T11:09:21Z msg="Loading configuration" module=agent.go
level=info time=2020-08-28T11:09:21Z msg="Amazon ECS agent Version: 1.44.1, Commit: 1f05fbf0" module=agent.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-pause:0.1.0" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Image excluded from cleanup: amazon/amazon-ecs-agent:latest" module=docker_image_manager.go
level=info time=2020-08-28T11:09:21Z msg="Creating root ecs cgroup: /ecs" module=init_linux.go
level=info time=2020-08-28T11:09:21Z msg="Creating cgroup /ecs" module=cgroup_controller_linux.go
level=info time=2020-08-28T11:09:21Z msg="Event stream ContainerChange start listening..." module=eventstream.go
level=info time=2020-08-28T11:09:21Z msg="Loading state!" module=state_manager.go
level=info time=2020-08-28T11:09:23Z msg="Registering Instance with ECS" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Remaining mem: 7680" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Registered container instance with cluster!" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Registration completed successfully. I am running as 'arn:aws:ecs:eu-west-1:XXXXXXXXXXXXXXXX:container-instance/foqum-read-dev-train-ecs-cluster/95559f936f8d44de9373595009fcd588' in cluster 'foqum-read-dev-train-ecs-cluster'" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Beginning Polling for updates" module=agent.go
level=info time=2020-08-28T11:09:23Z msg="Initializing stats engine" module=engine.go
level=info time=2020-08-28T11:09:23Z msg="Event stream DeregisterContainerInstance start listening..." module=eventstream.go
level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXXX-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXXXX%3Acontainer-instance%2FXXXXXXXX-cluster%2F95559fXXXXXXde9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
level=info time=2020-08-28T11:09:23Z msg="NO_PROXY set:XXX.254.169.XXXX,XXXX.254.XXX.2,/var/run/docker.sock" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Establishing a Websocket connection to https://ecs-a-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&clusterArn=XXXXX-ecs-cluster&containerInstanceArn=arn%3Aaws%3Aecs%3Aeu-west-1%XXXXXX%3Acontainer-instance%2FXXXXX-ecs-cluster%2F9XXXXX6f8d44de9373595009fcd588&dockerVersion=DockerVersion%3A+19.03.6-ce&sendCredentials=true&seqNum=1" module=client.go
level=info time=2020-08-28T11:09:23Z msg="Connected to TCS endpoint" module=handler.go
level=info time=2020-08-28T11:09:23Z msg="Connected to ACS endpoint" module=acs_handler.go
level=info time=2020-08-28T11:20:04Z msg="TCS Websocket connection closed for a valid reason" module=handler.go
level=info time=2020-08-28T11:20:04Z msg="Establishing a Websocket connection to https://ecs-t-X.eu-west-1.amazonaws.com/ws?agentHash=1f05fbf0&agentVersion=1.44.1&cluster=XXXXXXXecs-cluster&containerInstance=arn%3Aaws%3Aecs%3Aeu-west-1%3AXXXXXX3Acontainer-instance%2FZZZXXXXX-ecs-cluster%2F95XXX936f8d44de9373595009fcd588&dockerVersion=19.03.6-ce" module=client.go
level=info time=2020-08-28T11:20:04Z msg="Connected to TCS endpoint" module=handler.go
Run Code Online (Sandbox Code Playgroud)

ecs-init.log

2020-08-28T11:09:19Z [INFO] pre-start
2020-08-28T11:09:20Z [INFO] start
2020-08-28T11:09:20Z [INFO] No existing agent container to remove.
2020-08-28T11:09:20Z [INFO] Starting Amazon Elastic Container Service Agent
Run Code Online (Sandbox Code Playgroud)

ain*_*sti 3

最后!!解开了谜团!

问题不在于集群配置。通过 ECS API 调用 run_task 时,您需要指定任务应运行到的子网。

我们的代码在此字段中设置公共子网之一的值。因此,当我们将容器实例更改为与此公共子网对应的可用区时,任务就被放置了。

从代码中更改此调用可以正确放置任务并且可以访问互联网。