← All posts

Why We Chose ECS Fargate Over Kubernetes

The first serious infrastructure conversation on any new microservices project eventually arrives at container orchestration. In 2024, the options crystallized around two choices: Kubernetes (EKS on AWS, or self-managed) and ECS Fargate. We chose ECS Fargate and have been happy with the decision for well over a year of production operations.

This post explains the reasoning. It’s honest about the tradeoffs — ECS Fargate isn’t better in all scenarios — but clear about why it was the right choice for us at this stage.

Why Kubernetes Comes Up

Kubernetes has become the default orchestrator for serious microservices work. It has a massive ecosystem, excellent tooling, strong community support, and deep integration with every major cloud provider. It handles complex scheduling, rolling deployments, pod disruption budgets, horizontal pod autoscaling, cluster autoscaling, and service mesh integration. If you need any of those things, Kubernetes gives you a path.

For organizations with dedicated platform engineering teams, Kubernetes is often the right answer. The control it provides scales to very large, very complex deployments.

Why Kubernetes Is Overkill for Us

We have four product microservices and a platform service. Our traffic patterns are bursty but not extreme. Our services are stateless. We don’t have inter-service traffic patterns that require a service mesh. We don’t have teams of engineers who will fight over cluster resources.

Running EKS requires either:

Managed node groups. You provision EC2 instances, pay for them whether you’re using all their capacity or not, and manage the lifecycle of the Kubernetes nodes. You need to think about which instance types to use, how many nodes per availability zone, and what happens when nodes need updates.

EKS Fargate profiles. AWS manages the underlying compute for Kubernetes pods. Better, but you’re still running a Kubernetes control plane. The control plane costs $0.10/hour — $73/month before you’ve run a single container. Plus you need a cluster admin who understands Kubernetes RBAC, networking plugins (VPC CNI), and the constant stream of Kubernetes version upgrades.

EKS also has an operational overhead tax: the Kubernetes API server, etcd, and admission controllers are a complex distributed system. They fail in interesting ways. Diagnosing pod scheduling failures requires understanding node affinity, taint tolerations, resource requests, and limits. Understanding why a pod is stuck in Pending is a whole investigative process.

What ECS Fargate Gives Us

ECS Fargate removes the control plane entirely. You define a task definition (which container image, how much CPU and memory, which environment variables, which IAM role) and a service (how many running copies, what load balancer to attach to). AWS manages the underlying compute.

The operational surface area is dramatically smaller:

  • No node groups to manage
  • No Kubernetes version upgrades
  • No RBAC to configure
  • No admission controller to debug

Deploying a new version of a service means updating the task definition with the new container image and triggering a new ECS deployment. ECS handles draining connections from the old task, starting the new one, and verifying it passes the health check. If the health check fails, ECS automatically rolls back. This is everything we need.

The Cost Comparison

ECS Fargate pricing is pure pay-per-use: CPU-seconds and memory-GB-seconds consumed while tasks are running. There’s no per-cluster fee.

At our scale (each service task runs at 0.5 vCPU / 1 GB), the monthly cost for four product services plus the platform service in two availability zones is in the range of $50-80/month in task compute. Compare to the EKS control plane alone at $73/month, before any node compute.

NAT gateways would add another $100+/month for three AZs. As discussed in our infrastructure organization post, we run on public subnets to avoid this cost. This works with ECS Fargate and doesn’t work cleanly with the VPC CNI networking assumptions that EKS prefers.

The Tradeoffs We Accept

Less flexibility in scheduling. ECS doesn’t support the fine-grained scheduling policies Kubernetes does. We can’t express “run this task on an instance with a specific GPU” or implement complex anti-affinity rules. For our workloads (stateless HTTP services), this is fine.

Smaller ecosystem. Kubernetes has a much larger ecosystem of controllers, operators, and tools. Service meshes like Istio and Linkerd are Kubernetes-native. Distributed tracing infrastructure (Jaeger, Zipkin) integrates more naturally with Kubernetes via DaemonSet. We don’t need any of these today, but we’d need to evaluate alternatives if we did.

Horizontal scaling is less sophisticated. Kubernetes HPA can scale on custom metrics from Prometheus. ECS can scale on CloudWatch metrics, which covers our cases (CPU, memory, request count via ALB metrics) but isn’t as flexible.

The Revisit Triggers

We have explicit criteria for when we’d revisit this decision:

~15 microservices. At a certain point, the per-service overhead of ECS configurations compounds to the point where Kubernetes’s unified control plane starts looking attractive. We’re at five services; fifteen is a reasonable threshold to revisit.

Multi-cloud requirement. If we need to run on GCP or Azure, Kubernetes provides a more portable abstraction than ECS. Our current commitment to AWS makes ECS the right choice; that commitment could change.

Complex inter-service traffic. If we need mTLS between services, per-service traffic policies, or circuit breaking at the infrastructure layer, a service mesh on Kubernetes starts to look attractive. Right now we handle service-to-service auth at the application layer (shared secret + internal load balancer isolation).

A dedicated platform team. If we hire engineers whose full-time job is managing infrastructure, Kubernetes’s operational overhead becomes acceptable because there are people paid to absorb it. For a small team where engineers are also building product, minimizing infrastructure cognitive load is valuable.

Until one of these triggers fires, ECS Fargate is the answer.

Practical Deployment Notes

A few things that aren’t obvious from the documentation:

Task health checks and deployment success. ECS considers a deployment successful when the new task passes its health check. The health check must be fast (under 30 seconds) and reliable. A health check that queries the database will fail during database unavailability and potentially cause unnecessary rollbacks. We use a simple GET /health that returns 200 if the process is running, without checking external dependencies.

Task IAM roles vs. execution roles. ECS tasks have two IAM roles. The execution role gives ECS permission to pull the container image and read secrets. The task role gives the running application permission to call AWS APIs (S3, Secrets Manager, etc.). These are separate and must be configured independently.

Fargate CPU and memory sizing. Fargate requires specific CPU/memory combinations. You can’t specify arbitrary values. The valid combinations are documented by AWS. We use 0.5 vCPU / 1024 MB for most services, which is the smallest valid Fargate combination above the minimum and handles our traffic comfortably.

smplkit runs all product services on ECS Fargate, with no Kubernetes control plane. The decision reduces infrastructure overhead at our scale and avoids NAT gateway costs from public subnet deployment.