AWS Cloudwatch 代理配置文件在启动后被删除

Dan*_*iel 7 amazon-web-services amazon-cloudwatch

问题

我只是尝试在启动时使用 AWS 用户数据在 Amazon Linux 2 实例上安装 Cloudwatch Agent。由于某种原因,Cloud Init 运行完成后,所有服务都会重新启动,并且我放在 cloudwatch 文件夹中的配置文件不再存在。

我使用的是用 Packer 预先构建的自定义 AMI,我的配置文件是/opt/aws/amazon-cloudwatch-agent/etc/custom/amazon-cloudwatch-agent.json从 Ansible 模板放入的。这是我想要使用的配置文件,包含我想要发送的所有指标和日志。然后,我/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json在代理安装后将其复制到启动时。

这是我的用户数据脚本:

#!/bin/bash
yum install amazon-cloudwatch-agent -y
cp /opt/aws/amazon-cloudwatch-agent/etc/custom/amazon-cloudwatch-agent.json /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
Run Code Online (Sandbox Code Playgroud)

怎么了

启动完成后,我可以看到脚本正确运行。如果我运行cat /opt/aws/amazon-cloudwatch-agent/log/amazon-cloudwatch-agent.log我可以看到以下内容:

2021/07/16 13:33:46 I! I! Detected the instance is EC2
2021/07/16 13:33:46 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
2021/07/16 13:33:46 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_amazon-cloudwatch-agent.json ...
Valid Json input schema.
I! Detecting run_as_user...
No csm configuration found.
Configuration validation first phase succeeded

2021/07/16 13:33:46 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2021/07/16 13:33:46 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
2021/07/16 13:33:46 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_amazon-cloudwatch-agent.json ...
2021/07/16 13:33:46 I! Detected runAsUser: root
2021/07/16 13:33:46 I! Changing ownership of [/opt/aws/amazon-cloudwatch-agent/logs /opt/aws/amazon-cloudwatch-agent/etc /opt/aws/amazon-cloudwatch-agent/var] to root:root
2021-07-16T13:33:46Z I! Starting AmazonCloudWatchAgent 1.247347.4
2021-07-16T13:33:46Z I! Loaded inputs: netstat diskio logfile mem net processes swap cpu disk
2021-07-16T13:33:46Z I! Loaded aggregators:
2021-07-16T13:33:46Z I! Loaded processors: delta ec2tagger
2021-07-16T13:33:46Z I! Loaded outputs: cloudwatch cloudwatchlogs
2021-07-16T13:33:46Z I! Tags enabled: host=ip-XX-XX-X-XXX.eu-west-1.compute.internal
2021-07-16T13:33:46Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-XX-XX-X-XXX.eu-west-1.compute.internal", Flush Interval:1s
2021-07-16T13:33:46Z I! [logagent] starting
2021-07-16T13:33:46Z I! [logagent] found plugin cloudwatchlogs is a log backend
2021-07-16T13:33:46Z I! [logagent] found plugin logfile is a log collection
2021-07-16T13:33:46Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
=======> 2021-07-16T13:33:46Z I! cloudwatch: get unique roll up list [[AutoScalingGroupName] [InstanceId InstanceType] []]
2021-07-16T13:33:46Z I! cloudwatch: publish with ForceFlushInterval: 30s, Publish Jitter: 11s
2021-07-16T13:33:46Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2021-07-16T13:33:46Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
=======> 2021-07-16T13:33:47Z I! [logagent] piping log from APP-DEV-php-errors-logs/XX.XX.X.XXX(/var/log/php-fpm/error.log) to cloudwatchlogs
2021-07-16T13:33:54Z I! Profiler is stopped during shutdown
2021-07-16T13:33:54Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021/07/16 13:33:55 I! I! Detected the instance is EC2
2021/07/16 13:33:55 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
2021/07/16 13:33:55 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/default ...
Valid Json input schema.
I! Detecting run_as_user...
No csm configuration found.
No log configuration found.
Configuration validation first phase succeeded

2021/07/16 13:33:55 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2021/07/16 13:33:55 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
2021/07/16 13:33:55 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/default ...
2021/07/16 13:33:55 I! Detected runAsUser: cwagent
2021/07/16 13:33:55 I! Changing ownership of [/opt/aws/amazon-cloudwatch-agent/logs /opt/aws/amazon-cloudwatch-agent/etc /opt/aws/amazon-cloudwatch-agent/var] to 994:992
2021/07/16 13:33:55 I! Set HOME: /home/cwagent
2021-07-16T13:33:55Z I! Starting AmazonCloudWatchAgent 1.247348.0
2021-07-16T13:33:55Z I! Loaded inputs: disk mem
2021-07-16T13:33:55Z I! Loaded aggregators:
2021-07-16T13:33:55Z I! Loaded processors: ec2tagger
2021-07-16T13:33:55Z I! Loaded outputs: cloudwatch
2021-07-16T13:33:55Z I! Tags enabled: host=ip-XX-XX-X-XXX.eu-west-1.compute.internal
2021-07-16T13:33:55Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-XX-XX-X-XXX.eu-west-1.compute.internal", Flush Interval:1s
2021-07-16T13:33:55Z I! [logagent] starting
2021-07-16T13:33:55Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
=======> 2021-07-16T13:33:55Z I! cloudwatch: get unique roll up list []
2021-07-16T13:33:55Z I! cloudwatch: publish with ForceFlushInterval: 1m0s, Publish Jitter: 26s
2021-07-16T13:33:55Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2021-07-16T13:33:55Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2021-07-16T13:39:07Z I! [processors.ec2tagger] ec2tagger: Refresh is no longer needed, stop refreshTicker.
Run Code Online (Sandbox Code Playgroud)

正如您所看到的,来自 userdata 的初始命令运行良好,并且收集了自定义指标和日志(请参阅相关行之前的 ====> 标记)。然而几秒钟后,Cloud Init 结束后,systemd 会以某种方式重新启动 cloudwatch 代理,并且不知何故,amazon-cloudwatch-agent.json文件系统中不存在该文件,因此代理以默认参数运行。

但是,如果我在启动后手动重新运行命令,一切都会正常工作,但当然我需要在自动缩放启动时自动执行该命令。

我尝试过的

直接使用 systemd 启动 amazon cloudwatch 代理,尝试将配置文件更改为只读,仅获取配置并让系统自行启动代理,但问题仍然存在。

感谢您的帮助

Dan*_*iel 7

解决方法

预安装的 ssm-agent 与 Cloudwtach 代理冲突。在 Packer 构建期间卸载 ssm-agent:

sudo yum erase amazon-ssm-agent --assumeyes
Run Code Online (Sandbox Code Playgroud)

解释

我终于发现新安装的cloudwatch代理与Amazon Linux 2映像中默认安装的SSM代理冲突。事实上,我首先尝试了一个丑陋的解决方法,即在用户数据中使用 sed 替换 amazon-cloudwatch-agent 服务的 StartExec 行:

sed -i '/ExecStart/c\ExecStart=/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/custom/amazon-cloudwatch-agent.json' /etc/systemd/system/amazon-cloudwatch-agent.service
Run Code Online (Sandbox Code Playgroud)

这样,当服务在实例启动后重新启动时,它将使用我的自定义配置。然而我后来发现,在 Cloud Init 结束后,服务文件也被替换了。

查看系统消息时,我注意到 ssm-agent 在 Cloud Init 结束后正在执行一些配置重新加载,因此我认为它可能是罪魁祸首。我最终在构建 AMI 的打包程序构建中卸载了它,这样它就不会在实例启动时出现,最后我的配置不再被覆盖。

请注意,我对 ssm-agent 的工作原理没有深入的了解,并且可能有一种使用某些 SSM 配置实例化 Cloudwatch Agent 的正确方法。由于我们目前没有使用SSM,而且我没有足够的时间来研究这个选项,所以我选择了这个折衷方案。

如果有人能够提出一个更干净的解决方案,通过自动化方法使用 ssm-agent,我们将不胜感激。