Bru*_*eis 6 amazon-web-services amazon-cloudwatch
我有一个每2小时运行一次的备份脚本.我想使用CloudWatch跟踪此脚本的成功执行情况以及CloudWatch的警报,以便在脚本遇到问题时得到通知.
每次成功备份后,该脚本都会在CloudWatch指标上放置一个数据点:
mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1
Run Code Online (Sandbox Code Playgroud)
只要指标上的统计"总和"在6小时内小于2,我就会发出警报进入警报状态.
为了测试这个设置,一天之后,我停止将数据放入度量标准中(即,我注释掉了mon-put-data命令).很好,最终警报进入警报状态,我收到了电子邮件通知,正如预期的那样.
问题是,一段时间之后,警报返回到OK状态,但是没有新的数据添加到指标!
已记录两个转换(OK => ALARM,然后ALARM => OK),并在此问题中重现日志.请注意,尽管两者都显示"period:21600"(即6h),但第二个显示startDate和queryDate之间的12小时时间跨度; 我看到这可能解释了这种转变,但我无法理解为什么CloudWatch正在考虑用12小时的时间跨度计算一个6小时的统计数据!
我在这里错过了什么?如何配置警报以实现我想要的(即,如果没有进行备份,则会收到通知)?
{
"Timestamp": "2013-03-06T15:12:01.069Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-05T21:12:44.081+0000",
"startDate": "2013-03-05T15:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 3
}
},
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from OK to ALARM"
}
Run Code Online (Sandbox Code Playgroud)
第二个,我简单无法理解:
{
"Timestamp": "2013-03-06T17:46:01.063Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
},
"newState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T17:46:01.041+0000",
"startDate": "2013-03-06T05:46:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from ALARM to OK"
}
Run Code Online (Sandbox Code Playgroud)
小智 5
此行为(您的监视器未转换到INSFUCCIENT_DATA状态是因为Cloudwatch考虑'预时间戳'度量标准数据点等等(对于6小时警报)如果当前6个打开小时窗口中不存在数据...它将采用数据从之前的6小时窗口(因此您在上面看到的12小时时间戳).
要提高警报的"保真度",请将警报周期缩短至1小时/ 3600秒,并将评估周期数增加到故障时要警报的周期数.这将确保您的警报按预期转换为INSFUCCIENT_DATA.
如何配置警报以实现我想要的(即,如果没有进行备份,则会收到通知)?
如果您的工作成功,您的警报的可能架构将发布1,如果失败则为0.然后创建一个阈值<1的警报,持续3到3600秒,这意味着如果作业失败(即正在运行......但失败),警报将进入警报状态.如果您还对该警报设置了INSFUCCIENT_DATA操作,那么如果您的工作根本没有运行,您也会收到通知.
希望有道理.
归档时间: |
|
查看次数: |
4858 次 |
最近记录: |