Decoded Node: Notes on an ECS Deployment

In order to try FrankenPHP and increase service isolation, we decided to split our API service off of our monolithic EC2 instances. (The instances carry several applications side-by-side with PHP-FPM, and use Apache to route to the applications based on the Host header. Each app is not supposed to meddle in the neighbor’s affairs, but there’s no technical barrier there.)

I finally got a working deployment, and I learned a lot along the way. The documentation was a bit scattered, and searching for the error messages nearly useless, so I wanted to pull all of the things that tripped me up together into a single post. It’s the Swiss Cheese Model, except that everything has to line up for the process to succeed, rather than fail.

Networking problems
‘Force Redeployment’ is the normal course of operation
The health check is not optional
Logs are obscured by default
The ports have to be correct (Podman vs. build args)
The VPC Endpoint for an API Gateway “Private API” is not optional
There are many moving parts

Let’s take a deeper look.

Networking problems

Initially, I thought it would be nice to avoid giving every container a public IP address, creating VPC Endpoints to allow them to access AWS services. However, a VPC Interface Endpoint—the only option for the vast majority of services—turns out not to be free, and a VPC Gateway Endpoint only operates with public IP addresses. The primary advantage of a Gateway endpoint is the waiving of data-transfer fees.

By the time I would have created enough Interface endpoints to provide all necessary services to the container, it would be more expensive than a NAT Gateway. I am also not ready to take on creating my first IPv6-only environment, so I returned to having a public IPv4 for the containers. We can reconsider when we have enough public IPs to make alternatives cost competetive, or we decide it’s time for IPv6.

(A current problem with IPv6 is that it is apparently not universally supported by all AWS services, which makes it a messy project to determine whether we can actually use it. It’s not a trivial “Yes.”)

Compounding my lack of understanding about Gateway vs. Interface endpoints, I misconfigured the networking in other ways to start with. For one, I didn’t have any “allow outbound” rules on the security groups, which I fixed, and it didn’t help. It also turned out that I had forgotten to assign that one security group to the networking configuration.

In all cases, these network configuration errors resulted in I/O timeouts trying to pull the image from ECR. The error messages pointed to the DKR endpoint as the problem, even when the real problem was a lack of access to S3. (It had a Gateway endpoint for historical reasons, but not an Interface endpoint, which blocked the S3 traffic for a container with no public IP.)

‘Force Redeployment’ is the normal course of operation

Normally, the mildly-violent ‘force’ language is reserved for exceptional situations. However, it appears that using it is the only way to make configuration changes of any kind apply in ECS.

I was able to find that ECS added “version consistency” in July 2024. Meaning, ECS resolves the image tag (such as :latest) when the service is deployed, and uses the immutable identifier which results from that for all future operations on that deployment. It’s totally reasonable, but certainly not made clear in the console.

It also seems that the service doesn’t notice task configuration updates, despite having an option to change the task revision in the “Quick Update” action. It was extremely confusing to ask for a new health check configuration, have the console claim the new revision was in use, and yet, see the same old logs show up in CloudWatch.

The health check is not optional

The ECS console showed the health check configuration as “optional,” but in practice, this is completely false. Without a health check, the health of my container remained in UNKNOWN state, until it was replaced. (Upon reflection, it probably has something to do with my only container being marked as an Essential container. I added or kept that label without much thought, because it’s a single-container service. It’s not going to do anything at all without that container.)

Logs are obscured by default

Complicating the mess with health checking and redeployments, health check results are not logged by default. Hence, the advice to redirect it to the container’s main task logs with >>/proc/1/fd/1 2>&1. That’s the standard output of PID 1, which is the main service in the container.

It would be nice to have “include health check logs in CloudWatch” as an option under the health check configuration, just to make debugging easier as a new user.

Also complicating debugging, my container doesn’t have access logs, so it wasn’t possible to see what the health check was, or wasn’t, actually doing on the server level.

The ports have to be correct (Podman vs. build args)

I used some build args to create containers with minor differences in development and production. Unfortunately, one of those differences was the listening port; it was supposed to be 8008 in dev (default) and 80 in production (explicitly requested.)

Unfortunately, and unbeknownst to me, the Podman 4.9.3 version included in Ubuntu 24.04 LTS (Noble Numbat) doesn’t seem to take --build-arg=LISTEN_PORT=80 from the command line if the Containerfile has a default set, as in ARG LISTEN_PORT=8008. This would ultimately bite me when the ALB health checks on the traffic port never worked.

Worse, the ALB failures caused the service to roll out new containers extremely rapidly, so there wasn’t much chance to get into them and see what was going wrong. (At some point, I had enabled ECS Exec, but given up when it never worked, probably because I hadn’t actually deployed it yet. However, it would have remained technically difficult to get in and see the problem before the container was stopped by ALB.)

The VPC Endpoint for an API Gateway “Private API” is not optional

I set up a private API within API Gateway, and kind of assumed that AWS would do the magic to make it accessible to its own VPC (and/or subnets.) After all, the console says that the VPC Endpoint association is ’optional.' However, in this configuration, the execute-api hostname would not resolve.

Documentation is extremely detailed about solving errors when connecting to the API, but they leave out the possibility of getting NXDOMAIN for the hostname. In that case, one must create a VPC Endpoint and associate it to the Private API.

I would guess that the endpoint association is actually optional for a “Regional API”, which is why it is marked as optional in the API Gateway console. To be fair, the directions to associate a VPC Endpoint with a Private API are in the documentation, but it’s easy to skim over when the console labeled it optional.

There are many moving parts

In order to launch this as a test, instead of rolling production over immediately, we allocated a new domain name for it. Consequently, there’s a new TLS certificate loaded as SNI listener in the load balancer. There’s a new Listener Rule in the load balancer to route the new name to the containers, with a transform to give the container the old host name. The container is configured with the production name as virtual host, and wasn’t prepared to serve traffic on the new one.

Because ECS auto-registers into the load balancer, the load balancer needs an IP-address target group, and the load balancer needs to be registered in ECS. Registering the load balancer (and changing the security group associations) aren’t operations that are available in the console, and must be done with the AWS CLI instead.

Having a Private API and its associated VPC Interface Endpoint was specific to our setup (one API calls another to get its work done), but also added to the infrastructure demand.

It’s also clear that there is unfinished work for running this in production. Container logs are sent to CloudWatch Logs, which is separate from our central syslog aggregation and alerting system. Also, since the image is the artifact, we need to integrate building and consuming images into the deployment pipeline.

Sunday, December 7, 2025

Notes on an ECS Deployment