EC2 Java StartInstancesRequest 从“待处理”变为“正在停止”,再变为“已停止”

all*_*tic 0 java amazon-web-services aws-sdk

我有以下情况:

  • 专用租赁m4.large运行 RHEL6
  • 使用 AWS 控制台手动启动它效果很好
  • 尝试启动它的 Lambda 函数(用 Java 编写)失败,因为实例状态为:已停止 -> 待处理 -> 正在停止 -> 已停止

我有一个 Lambda 函数,用于记录 VPC 上的所有 EC2 状态更改,如下所示:

'use strict';
exports.handler = (event, context, callback) => {
  console.log('LogEC2InstanceStateChange');
  console.log('Received event:', JSON.stringify(event, null, 2));
  callback(null, 'Finished');
}
Run Code Online (Sandbox Code Playgroud)

还有另一个 Lambda 函数,尝试根据计划启动 EC2 实例,用 Java 编写,代码量很大,但其核心是这样的:

public void handleRequest(Object input, Context context) {
  final List<String> instancesToStart = getInstancesToStart(); //implementation not shown
  try {
    StartInstancesRequest startRequest = new StartInstancesRequest().withInstanceIds((String[]) instancesToStart.toArray());
    context.logger.log("StartInstancesRequest: " + startRequest.toString());
    StartInstancesResult res = ec2.startInstances(startRequest);
    context.logger.log("StartInstancesResult: " + res.toString());
  }
  catch(Exception e) {
    logException(e); //calls context.logger.log on the stack trace string
  }
}
Run Code Online (Sandbox Code Playgroud)

instancesToStart数组填充有实例 ID,例如i-0abcdef1234567890.

我使用 CloudFormation 创建 Lambda 函数和所有必需的 IAM 角色等。以下是描述执行该工作的基于 Java 的 Lambda 函数的角色/权限的部分:

Resources:
  EC2SchedulerRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      Path: /
  EC2SchedulerPolicy:
    DependsOn:
      - EC2SchedulerRole
    Type: 'AWS::IAM::Policy'
    Properties:
      PolicyName: ec2-scheduler-role
      Roles:
        - !Ref EC2SchedulerRole
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Action:
              - 'logs:*'
            Resource:
              - 'arn:aws:logs:*:*:*'
          - Effect: Allow
            Action:
              - 'ec2:DescribeInstanceAttribute'
              - 'ec2:DescribeInstanceStatus'
              - 'ec2:DescribeInstances'
              - 'ec2:StartInstances'
              - 'ec2:StopInstances'
              - 'ec2:DeleteTags'
            Resource:
              - '*'
Run Code Online (Sandbox Code Playgroud)

最终发生的情况是,根据第一个函数(记录实例状态转换的脚本)的 CloudWatch 日志,我们得到:

Received event:
{
    "version": "0",
    "id": "<guid>",
    "detail-type": "EC2 Instance State-change Notification",
    "source": "aws.ec2",
    "account": "12345678",
    "time": "2019-06-20T19:01:35Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
    ],
    "detail": {
        "instance-id": "i-0abcdef12345678",
        "state": "pending"
    }
}

Received event:
{
    "version": "0",
    "id": "<guid>",
    "detail-type": "EC2 Instance State-change Notification",
    "source": "aws.ec2",
    "account": "12345678",
    "time": "2019-06-20T19:01:37Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
    ],
    "detail": {
        "instance-id": "i-0abcdef12345678",
        "state": "stopping"
    }
}

Received event:
{
    "version": "0",
    "id": "<guid>",
    "detail-type": "EC2 Instance State-change Notification",
    "source": "aws.ec2",
    "account": "12345678",
    "time": "2019-06-20T19:01:37Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:ec2:us-east-1:12345678:instance/i-0abcdef12345678"
    ],
    "detail": {
        "instance-id": "i-0abcdef12345678",
        "state": "stopped"
    }
}
Run Code Online (Sandbox Code Playgroud)

根据“worker”函数(实际尝试启动实例的函数)的 CloudWatch 日志,我们得到:

StartInstancesRequest: {InstanceIds: [i-0abcdef12345678],}
StartInstancesResult: {StartingInstances: [{CurrentState: {Code: 0,Name: pending},InstanceId: i-0abcdef12345678,PreviousState: {Code: 80,Name: stopped}}]}
Run Code Online (Sandbox Code Playgroud)

因此,从执行这项工作的基于 Java 的 Lambda 的角度来看,它正在做所有需要做的事情,发出命令来启动 EC2 实例;但当 EC2 实例尝试实际启动时,它会从“待处理”变为“正在停止”,再变为“已停止”。如果没有得到许可的话,根本就不可能走到这一步吧?

如果这是实例本身的问题(例如硬件),我预计使用 AWS 控制台手动启动它会失败。但它不会失败。手动启动成功!

那么发生了什么事?我该如何进一步诊断?是权限问题还是实例搞砸了?

我 99% 确信这不是由于可用区缺乏可用容量,因为每当我尝试手动启动实例时,它总是可以工作。这不是一个短暂的问题,也不是最近才发生的事情。这种情况已经持续了几个月,其中手动启动的成功率为 100%,而基于脚本的启动成功率为 0%。

Say*_*dal 5

启动 EBS 可能是问题所在。正如您所提到的,EC2 有 3 个带有 KMS 加密的 EBS 卷。您必须提供 KMS 权限(kms:CreateGrant) 才能启动您的实例

{
        "Sid": "GrantAccess",
        "Effect": "Allow",
        "Action": "kms:CreateGrant",
        "Resource": "arn:aws:kms:::key/1234"
}
Run Code Online (Sandbox Code Playgroud)