We were running a Next.js app on a single EC2 instance with Docker Compose. Every deploy took the site down for a few seconds while the new container booted. One server, one Docker Compose file, and a deployment process that required stopping the old container before the new one could start.
This is how we got to zero downtime without adding any infrastructure or cost.
The problem with a standard Docker Compose deploy
The setup was simple: a Next.js app running inside Docker Compose on EC2, with nginx as a reverse proxy. Every deploy looked like this:
- Build the Docker image locally
- Tag it with a version number
- Push to AWS ECR
- SSH into the instance
- Edit the image tag in
docker-compose.prod.ymlby hand - Run
docker compose up
That last step is the problem. docker compose up stops the old container first, then starts the new one. Your site is down for however long the new container takes to boot, typically 5 to 30 seconds for a Next.js app. Every single deploy.
Production cannot run like that.
What blue-green deployment actually means
Blue-green is a simple idea. You run two versions of your app simultaneously and switch traffic between them.
- One container is live (serving real traffic)
- One container is idle (either the previous version or about to become the new one)
When you deploy:
- Start the new version in the idle container
- Wait until it is healthy
- Switch the load balancer (or in our case, nginx) to point at the new container
- Shut down the old one gracefully
nginx switches traffic before you touch the old container. There is no gap where neither container is running.
On Kubernetes this is built-in. On ECS Fargate you use a target group. On a single EC2 instance, you build it yourself. It turns out it is not that complicated.
Why we stayed on EC2
The obvious next step from a manual EC2 setup is ECS Fargate. But that migration meant replacing nginx and Certbot with an Application Load Balancer and ACM, moving Redis off the instance to ElastiCache, and rewriting the deployment pipeline from scratch.
That is a lot of moving parts when the existing stack is working fine. The EC2 approach gives us zero-downtime deploys using only the tools already in the stack. We can migrate to ECS when traffic and team size make it worth the effort.
The architecture
Two app containers exist at all times: nextjs_blue and nextjs_green. Only one is live at any point. A file on disk (.active_color) tracks which one.
EC2 Instance
โโโ nextjs_blue OR nextjs_green (only one live - managed by deploy.sh)
โโโ redis (docker-compose.prod.yml)
โโโ nginx (docker-compose.prod.yml)
โ โโโ upstream.conf (rewritten on each deploy to switch containers)
โโโ certbot (docker-compose.prod.yml)
nginx routes via the Docker container name over the internal Docker network, not via the host port. The host ports (3000 and 3001) exist only for health checking during the deploy. They are not exposed publicly.
# nginx/conf.d/upstream.conf
upstream nextjs_upstream {
server nextjs_blue:3000; # or nextjs_green:3000 - swapped on each deploy
}
Six design decisions worth knowing
1. Never tag images as :latest on ECR
ECR caches the :latest tag aggressively. The EC2 instance would pull it and silently run the old image. Every image gets tagged with the git SHA instead:
staging-a1b2c3d4e5f6
Unambiguous. Traceable. Never cached.
2. IAM instance role instead of static credentials
The original setup had AWS credentials sitting as plain environment variables inside docker-compose.prod.yml on the server.
An IAM Instance Role attached to the EC2 instance replaces all of that. The AWS SDK picks up credentials automatically from the instance metadata endpoint. No keys in any file anywhere. Revoking access is just detaching the role.
For GitHub Actions (which needs to push images to ECR), we created a separate IAM user with a policy scoped only to ECR push on that specific repository, not the full AmazonEC2ContainerRegistryFullAccess, which also grants delete permissions.
3. NEXT_PUBLIC_* variables belong at build time
This one catches a lot of Next.js developers out. NEXT_PUBLIC_* variables get inlined into the compiled JavaScript bundle during next build. By the time the container starts, those values are already baked into the output files. Passing them as runtime environment variables to docker run does nothing.
The Dockerfile makes this explicit:
# Stage 2: build
FROM node:24-alpine AS builder
WORKDIR /app
# Receive NEXT_PUBLIC_* vars as build args
ARG NEXT_PUBLIC_APP_ENV
ENV NEXT_PUBLIC_APP_ENV=$NEXT_PUBLIC_APP_ENV
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build # values are compiled into the bundle here
GitHub Actions passes them as --build-arg at build time. The deploy script on EC2 only injects genuinely runtime-only variables (REDIS_URL).
4. docker stop --timeout
docker stop --timeout 30 sends SIGTERM to the Node.js process. The HTTP server stops accepting new connections and works through any requests still in flight. After 30 seconds, if the process has not exited, Docker sends SIGKILL.
For a typical web request, 30 seconds is more than enough. Most requests complete in milliseconds. The timeout is just a ceiling.
Tune this based on your workload. If your app handles long-running operations like file uploads or video processing, bump the timeout to give those requests room to finish. --timeout 120 is a reasonable starting point for upload-heavy flows. You could also let the old container keep running a bit longer and remove it manually once you are confident active requests have drained.
This is deterministic. The old container does not die until Node.js signals it is done, or the timeout fires. You always know what happened and why.
5. docker-compose.prod.yml owns infrastructure, deploy.sh owns the app
Splitting responsibility cleanly:
docker-compose.prod.ymlstarts and manages the stable infrastructure: redis, nginx, certbot. These containers rarely change and never need blue-green treatment.deploy.shmanages the app containers:nextjs_blueandnextjs_green. It creates them withdocker rundirectly (not Compose), so it has full control over naming and the swap sequence.
# docker-compose.prod.yml - infrastructure only.
# The app containers (nextjs_blue / nextjs_green) are NOT here.
# They are managed by deploy.sh.
services:
redis:
image: redis:7
container_name: redis_server
restart: always
networks:
- app_net
nginx:
build: ./nginx
container_name: nginx_proxy
restart: always
networks:
- app_net
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/conf.d:/etc/nginx/conf.d # upstream.conf is rewritten by deploy.sh on each deploy
certbot:
image: certbot/certbot
restart: unless-stopped
# auto-renews SSL certs every 12h
networks:
app_net:
name: app_net
Both sets of containers join the same explicit Docker network so nginx can reach the app containers by name.
6. GitHub environments for secrets
Secrets for staging live in a GitHub Environment named staging. Secrets for production will live in prod. Same secret names, different values. The workflow references the environment by name. Swapping is a one-line change. No renamed secrets, no duplicate workflow files.
jobs:
build-and-deploy:
runs-on: ubuntu-latest
environment: staging # change this one line for prod
steps:
- name: Build and push Docker image
env:
IMAGE_TAG: staging-${{ github.sha }}
run: |
docker build \
--build-arg NEXT_PUBLIC_APP_ENV=${{ secrets.NEXT_PUBLIC_APP_ENV }} \
# secrets.NEXT_PUBLIC_APP_ENV resolves to "staging" here,
# and to "production" in the prod environment - same key, different value
The GitHub Actions workflow
The full pipeline triggers on every push to the staging branch:
name: Deploy - Staging
on:
push:
branches:
- staging
jobs:
build-and-deploy:
runs-on: ubuntu-latest
environment: staging # secrets scoped to this GitHub Environment
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Login to Amazon ECR
id: ecr-login
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push Docker image
id: build
env:
IMAGE_TAG: staging-${{ github.sha }} # never :latest
run: |
docker build \
--build-arg NEXT_PUBLIC_APP_ENV=${{ secrets.NEXT_PUBLIC_APP_ENV }} \
-t $ECR_REGISTRY/$ECR_REPO:$IMAGE_TAG \
./app
docker push $ECR_REGISTRY/$ECR_REPO:$IMAGE_TAG
- name: Deploy to EC2 via SSH
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.EC2_HOST }}
username: ${{ secrets.EC2_USER }}
key: ${{ secrets.EC2_SSH_KEY }}
script: |
IMAGE_TAG=${{ steps.build.outputs.image_tag }} \
ECR_REGISTRY=${{ secrets.ECR_REGISTRY }} \
bash ~/scripts/deploy.sh
The last step is the entire deploy: SSH into the instance, pass the image tag, run one script.
The deploy script
deploy.sh lives on the EC2 instance. GitHub Actions calls it over SSH with the image tag as an environment variable. Here is the full logic:
#!/usr/bin/env bash
# Blue-green deploy script.
# Called by GitHub Actions via SSH. Also safe to run manually.
#
# Required env vars:
# IMAGE_TAG e.g. staging-a1b2c3d
# ECR_REGISTRY your ECR registry URL
set -euo pipefail
DOCKER_NETWORK="app_net"
UPSTREAM_CONF="~/app/nginx/conf.d/upstream.conf"
STATE_FILE="~/app/.active_color"
HEALTH_RETRIES=36 # 36 x 5s = 3 min max before rollback
HEALTH_INTERVAL=5
# Resolve which container is active and which is idle
ACTIVE_COLOR=$(cat "$STATE_FILE" 2>/dev/null || echo "none")
if [[ "$ACTIVE_COLOR" == "blue" ]]; then
NEW_COLOR="green"; NEW_HOST_PORT=3001
OLD_COLOR="blue"; OLD_HOST_PORT=3000
else
NEW_COLOR="blue"; NEW_HOST_PORT=3000
OLD_COLOR="green"; OLD_HOST_PORT=3001
fi
# [1/6] Login to ECR using IAM Instance Role - no static credentials needed
aws ecr get-login-password --region "$AWS_REGION" \
| docker login --username AWS --password-stdin "$ECR_REGISTRY"
# [2/6] Pull the new image - always a specific SHA tag, never :latest
docker pull "$ECR_REGISTRY/$ECR_REPO:$IMAGE_TAG"
# [3/6] Start the idle container on a host port for health checking
docker stop "nextjs_${NEW_COLOR}" 2>/dev/null || true
docker rm "nextjs_${NEW_COLOR}" 2>/dev/null || true
docker run -d \
--name "nextjs_${NEW_COLOR}" \
--network "$DOCKER_NETWORK" \
--restart unless-stopped \
-p "${NEW_HOST_PORT}:3000" \
-e REDIS_URL="redis://redis:6379" \
"$ECR_REGISTRY/$ECR_REPO:$IMAGE_TAG"
# [4/6] Health check - on failure, remove new container, old one keeps serving
for i in $(seq 1 "$HEALTH_RETRIES"); do
if curl -sf --max-time 5 "http://localhost:${NEW_HOST_PORT}/" > /dev/null 2>&1; then
echo "Health check passed (attempt $i)"
break
fi
if [[ $i -eq $HEALTH_RETRIES ]]; then
echo "Health check failed. Rolling back."
docker stop "nextjs_${NEW_COLOR}" 2>/dev/null || true
docker rm "nextjs_${NEW_COLOR}" 2>/dev/null || true
exit 1
fi
sleep "$HEALTH_INTERVAL"
done
# [5/6] Switch nginx to the new container - graceful reload, no dropped connections
cat > "$UPSTREAM_CONF" <<EOF
upstream nextjs_upstream {
server nextjs_${NEW_COLOR}:3000;
}
EOF
docker exec nginx_proxy nginx -s reload
# [6/6] Drain and stop the old container
# nginx has already stopped routing new traffic here.
# SIGTERM lets Node.js finish in-flight requests before exiting.
docker stop --timeout 30 "nextjs_${OLD_COLOR}" 2>/dev/null || true
docker rm "nextjs_${OLD_COLOR}" 2>/dev/null || true
# Record the new active color
echo "$NEW_COLOR" > "$STATE_FILE"
echo "Deploy complete. Active: ${NEW_COLOR} (${IMAGE_TAG})"
The health check is the safety net. If the new container fails to respond within 3 minutes, the script exits with an error, the new container is removed, and the old container is untouched and still serving. GitHub Actions marks the run as failed. You get a red check, the site never went down, and no manual intervention is needed.
http://localhost:${NEW_HOST_PORT}/, which is the root route. If your root route is slow to respond or returns a redirect, you can get false negatives. Point it at a dedicated health endpoint like /api/health that returns a simple 200 immediately. Faster and unambiguous.
nginx -s reload is not a restart. It is a graceful reload. nginx forks new worker processes with the updated config, lets the old workers finish their current requests, then exits them. In-flight connections on the old container complete normally. New connections go straight to the new container.
The script is also idempotent. It cleans up any leftover container from a previous failed deploy before starting the new one. Safe to run manually if needed.
What the final state looks like
Push to 'staging' branch
|
v
GitHub Actions
|-- Build Docker image (NEXT_PUBLIC_* baked in via --build-arg)
|-- Push to ECR (tag: staging-{git-sha})
`-- SSH into EC2 -> deploy.sh $IMAGE_TAG
|-- Pull new image
|-- Start nextjs_green on :3001
|-- Health check (3 min timeout, auto-rollback on failure)
|-- nginx -s reload -> traffic switches to green
|-- docker stop --timeout 30 nextjs_blue
`-- Write "green" to .active_color
No manual steps. No static credentials on the server. No downtime. Same EC2 instance, no new costs.
When to move to ECS
This setup works well for a single-instance deployment. You will outgrow it when you need multiple EC2 instances behind a real load balancer, when the operational overhead of managing EC2 is no longer worth it for your team size, or when you need auto-scaling based on traffic.
Until then, this is solid. It covers the 80% case: a production app that needs to deploy reliably without gaps, without the complexity or cost of a full container orchestration platform.
The boilerplate
The full setup including deploy.sh, the GitHub Actions workflow, the nginx config, and the Dockerfile is available as a boilerplate on GitHub. Structured to be adapted to any stack, not just Next.js.
Built this while setting up our own staging environment. Happy to answer questions on any part of it in the replies.
Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Delhi.