EC2 VPC 间歇性出站连接超时

Question

EC2 VPC 间歇性出站连接超时

Dan*_*lB6 5 amazon-ec2 amazon-web-services amazon-elb amazon-vpc

我的生产网络服务包括：

自动缩放组
网络负载均衡器 (ELB)
2 个 EC2 实例作为 Web 服务器

直到昨天，当 EC2 实例之一开始遇到 RDS 和 ElastiCache 超时时，此配置运行良好。另一个实例继续运行，没有问题。

在调查期间，我注意到传出连接通常有时会遇到很大的延迟：

[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m7.147s -- 7 seconds
user    0m0.007s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m3.114s
user    0m0.007s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m0.051s
user    0m0.006s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    1m6.309s -- over a minute!
user    0m0.009s
sys     0m0.000s

[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
 1  * * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
 1  216.182.226.174  17.706 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.8.4), 1 hops max, 60 byte packets
 1  216.182.226.174  20.364 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.132), 1 hops max, 60 byte packets
 1  216.182.226.170  12.680 ms  12.671 ms *

Run Code Online (Sandbox Code Playgroud)

进一步的分析表明，如果我手动将“坏”实例从自动缩放组中分离，将其作为负载均衡器目标移除，问题会立即消失。一旦我把它加回来，问题又回来了。

这些节点是 m5.xlarge 并且看起来容量过剩，所以我不认为这是资源问题。

更新：它似乎与节点上的负载有关。我昨晚重新加载了负载，它看起来很稳定，但是今天早上随着负载的增加，出站流量（DB 等）开始出现故障。我真的不明白这个出站流量是如何受到影响的。另一个相同的节点没有问题，即使是 100% 的流量与 50% 的流量。

traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
 1  216.182.226.174  18.691 ms 216.182.226.166  18.341 ms 216.182.226.174  18.660 ms
traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
 1  * * *

Run Code Online (Sandbox Code Playgroud)

216.182.226.166 IP是什么？和VPC IGW有关系吗？

节点统计：

m5.xlarge
CPU ~ 7.5%
平均负载：0.18、0.29、0.29
网络输入：~8M 字节/分钟

更新：将 2 个节点中的 1 个连接到负载均衡器后，事情似乎运行稳定——所有流量都在一个节点上。在我将第二个节点添加到负载均衡器后，经过一段时间（几小时 - 几天），其中一个节点开始出现上述出站连接问题（连接到数据库、Google 等时超时）。在这种状态下，另一个节点工作正常。替换“坏”或在负载平衡器中恢复它可以让事情运行一段时间。这些映像使用 Amazon Linux 2 (4.14.114-103.97.amzn2.x86_64)。

Answer 1

小智 0

您可能正在使用 NAT 网关/实例来访问互联网。如果没有，您可能需要提供更多有关架构的信息。您可以使用直接连接，并可能通过本地网络路由互联网。

请阅读有关系统限制、临时端口入站连接的内容。

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-recommended-nacl-rules.html https://aws.amazon.com/premiumsupport/knowledge-center/resolve-connection-nat-i nstance /

归档时间：	6 年，8 月前
查看次数：	937 次
最近记录：	6 年，8 月前