Why we chose ECS Fargate over Kubernetes

Every microservices project eventually has the orchestration conversation. By 2024 the realistic options had narrowed to two: Kubernetes — EKS, if you live on AWS — or ECS Fargate. We chose Fargate, we’re well over a year into running production on it, and we’d choose it again. The reasoning is below, along with the three operational notes we wish the documentation had led with.

What Kubernetes is for

Kubernetes earned its position. Enormous ecosystem, excellent tooling, deep integration with every major cloud, and machinery for genuinely hard problems: complex scheduling, pod disruption budgets, horizontal and cluster autoscaling, service mesh integration. For an organization with a dedicated platform engineering team, it’s often the right answer, and this post isn’t arguing otherwise.

What we actually have

Four product microservices and a platform service. Bursty but unexceptional traffic. Stateless services. No inter-service traffic patterns that justify a mesh, and no teams of engineers competing for cluster resources.

Running EKS anyway would mean one of two postures. Managed node groups: provision EC2 instances, pay for them whether they’re busy or not, and take on node lifecycle — instance types, nodes per availability zone, node updates. Or EKS Fargate profiles: AWS runs the compute, but you’re still running a Kubernetes control plane at $0.10/hour — $73/month before the first container starts — and you still need someone who understands Kubernetes RBAC, the VPC CNI networking plugin, and the version-upgrade treadmill.

That last requirement is the real cost. The API server, etcd, and admission controllers are a complex distributed system in their own right, and they fail in interesting ways. Working out why a pod is stuck in Pending — node affinity? taints? resource requests and limits? — is not a glance at a dashboard; it’s an investigation.

What Fargate replaces it with

ECS Fargate deletes the control plane from the design. You write a task definition (container image, CPU and memory, environment variables, IAM role) and a service (how many running copies, which load balancer). AWS manages the compute underneath.

The operational surface that disappears: no node groups, no Kubernetes version upgrades, no RBAC to configure, no admission controllers to debug. Deploying is updating the task definition with a new image and triggering a deployment; ECS drains connections from the old task, starts the new one, checks health, and rolls back automatically if the health check fails. For stateless HTTP services, that’s the entire requirements list.

The bill

Fargate pricing is pure pay-per-use — CPU-seconds and memory-GB-seconds while tasks run, no per-cluster fee. At our size (0.5 vCPU / 1 GB per task), four product services plus the platform service across two availability zones runs $50–80/month in task compute. The EKS control plane alone is $73/month, before any node compute joins the invoice.

NAT gateways would add another $100+/month across three AZs. We avoid that by running on public subnets — a choice covered in our infrastructure organization post — which works with ECS Fargate and doesn’t work cleanly with the networking assumptions the EKS VPC CNI prefers.

The tradeoffs

Scheduling is blunter. No fine-grained placement policies — we can’t express “run this task on an instance with a specific GPU” or build complex anti-affinity rules. For stateless HTTP services, we’ve never wanted to.

The ecosystem is smaller. Istio and Linkerd are Kubernetes-native; tracing infrastructure like Jaeger and Zipkin integrates more naturally there via DaemonSet. We need none of it today; if that changes, we’ll be evaluating alternatives rather than picking from a menu.

Autoscaling is less sophisticated. Kubernetes HPA can scale on custom Prometheus metrics; ECS scales on CloudWatch metrics. CloudWatch covers our actual cases — CPU, memory, request count via ALB metrics — but it is the less flexible tool.

When we’d reconsider

We wrote down exactly when we’d revisit:

Around 15 microservices. Per-service ECS configuration compounds; somewhere around fifteen services, a unified control plane starts earning its overhead. We’re at five.
A multi-cloud requirement. Kubernetes is the portable abstraction. We’re committed to AWS today; commitments have been known to change.
Complex inter-service traffic. If we need mTLS between services, per-service traffic policies, or circuit breaking at the infrastructure layer, a service mesh on Kubernetes gets attractive. Today, service-to-service auth is handled at the application layer with a shared secret plus internal load-balancer isolation.
A dedicated platform team. Kubernetes overhead is acceptable when people are paid to absorb it. When the same engineers build product, infrastructure cognitive load is a tax on everything.

Until one of those fires, Fargate is the answer.

Notes from a year of running it

Health checks must not check dependencies. ECS considers a deployment successful when the new task passes its health check, and rolls back when it doesn’t — so a health check that queries the database will fail during database unavailability and convert someone else’s outage into your deploy’s rollback. Ours is a GET /health that returns 200 if the process is running and checks nothing external. It also needs to be fast — under 30 seconds — and reliable.

There are two IAM roles, and you need both. The execution role lets ECS pull the container image and read secrets; the task role is what the running application uses to call AWS APIs (S3, Secrets Manager, and so on). They’re separate, they’re configured independently, and mixing them up is the first thing to check when a task can’t read something it obviously should.

CPU and memory come in fixed pairs. Fargate only accepts specific CPU/memory combinations — no arbitrary values; the valid pairs are documented by AWS. We run 0.5 vCPU / 1024 MB for most services: the smallest valid combination above the minimum, and it handles our traffic comfortably.