Zero-downtime deploys on a single EC2 instance

We were running a Next.js app on a single EC2 instance with Docker Compose. Every deploy took the site down for a few seconds while the new container booted. One server, one Docker Compose file, and a deployment process that required stopping the old container before the new one could start.

This is how we got to zero downtime without adding any infrastructure or cost.

The problem with a standard Docker Compose deploy

The setup was simple: a Next.js app running inside Docker Compose on EC2, with nginx as a reverse proxy. Every deploy looked like this:

Build the Docker image locally
Tag it with a version number
Push to AWS ECR
SSH into the instance
Edit the image tag in docker-compose.prod.yml by hand
Run docker compose up

That last step is the problem. docker compose up stops the old container first, then starts the new one. Your site is down for however long the new container takes to boot, typically 5 to 30 seconds for a Next.js app. Every single deploy.

Production cannot run like that.

What blue-green deployment actually means

Blue-green is a simple idea. You run two versions of your app simultaneously and switch traffic between them.

One container is live (serving real traffic)
One container is idle (either the previous version or about to become the new one)

When you deploy:

Start the new version in the idle container
Wait until it is healthy
Switch the load balancer (or in our case, nginx) to point at the new container
Shut down the old one gracefully

nginx switches traffic before you touch the old container. There is no gap where neither container is running.

On Kubernetes this is built-in. On ECS Fargate you use a target group. On a single EC2 instance, you build it yourself. It turns out it is not that complicated.

Why we stayed on EC2

The obvious next step from a manual EC2 setup is ECS Fargate. But that migration meant replacing nginx and Certbot with an Application Load Balancer and ACM, moving Redis off the instance to ElastiCache, and rewriting the deployment pipeline from scratch.

That is a lot of moving parts when the existing stack is working fine. The EC2 approach gives us zero-downtime deploys using only the tools already in the stack. We can migrate to ECS when traffic and team size make it worth the effort.

The architecture

Two app containers exist at all times: nextjs_blue and nextjs_green. Only one is live at any point. A file on disk (.active_color) tracks which one.

bash

EC2 Instance
├── nextjs_blue  OR  nextjs_green   (only one live - managed by deploy.sh)
├── redis                           (docker-compose.prod.yml)
├── nginx                           (docker-compose.prod.yml)
│     └── upstream.conf             (rewritten on each deploy to switch containers)
└── certbot                         (docker-compose.prod.yml)

nginx routes via the Docker container name over the internal Docker network, not via the host port. The host ports (3000 and 3001) exist only for health checking during the deploy. They are not exposed publicly.

nginx

# nginx/conf.d/upstream.conf
upstream nextjs_upstream {
    server nextjs_blue:3000;   # or nextjs_green:3000 - swapped on each deploy
}

Six design decisions worth knowing

1. Never tag images as :latest on ECR

ECR caches the :latest tag aggressively. The EC2 instance would pull it and silently run the old image. Every image gets tagged with the git SHA instead:

bash

staging-a1b2c3d4e5f6

Unambiguous. Traceable. Never cached.

2. IAM instance role instead of static credentials

The original setup had AWS credentials sitting as plain environment variables inside docker-compose.prod.yml on the server.

An IAM Instance Role attached to the EC2 instance replaces all of that. The AWS SDK picks up credentials automatically from the instance metadata endpoint. No keys in any file anywhere. Revoking access is just detaching the role.

For GitHub Actions (which needs to push images to ECR), we created a separate IAM user with a policy scoped only to ECR push on that specific repository, not the full AmazonEC2ContainerRegistryFullAccess, which also grants delete permissions.

3. NEXT_PUBLIC_* variables belong at build time

This one catches a lot of Next.js developers out. NEXT_PUBLIC_* variables get inlined into the compiled JavaScript bundle during next build. By the time the container starts, those values are already baked into the output files. Passing them as runtime environment variables to docker run does nothing.

The Dockerfile makes this explicit:

dockerfile

# Stage 2: build
FROM node:24-alpine AS builder
WORKDIR /app

# Receive NEXT_PUBLIC_* vars as build args
ARG NEXT_PUBLIC_APP_ENV

ENV NEXT_PUBLIC_APP_ENV=$NEXT_PUBLIC_APP_ENV

COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build   # values are compiled into the bundle here

GitHub Actions passes them as --build-arg at build time. The deploy script on EC2 only injects genuinely runtime-only variables (REDIS_URL).

4. docker stop --timeout

docker stop --timeout 30 sends SIGTERM to the Node.js process. The HTTP server stops accepting new connections and works through any requests still in flight. After 30 seconds, if the process has not exited, Docker sends SIGKILL.

For a typical web request, 30 seconds is more than enough. Most requests complete in milliseconds. The timeout is just a ceiling.

Tune this based on your workload. If your app handles long-running operations like file uploads or video processing, bump the timeout to give those requests room to finish. --timeout 120 is a reasonable starting point for upload-heavy flows. You could also let the old container keep running a bit longer and remove it manually once you are confident active requests have drained.

This is deterministic. The old container does not die until Node.js signals it is done, or the timeout fires. You always know what happened and why.

5. docker-compose.prod.yml owns infrastructure, deploy.sh owns the app

Splitting responsibility cleanly:

docker-compose.prod.yml starts and manages the stable infrastructure: redis, nginx, certbot. These containers rarely change and never need blue-green treatment.
deploy.sh manages the app containers: nextjs_blue and nextjs_green. It creates them with docker run directly (not Compose), so it has full control over naming and the swap sequence.

yaml

# docker-compose.prod.yml - infrastructure only.
# The app containers (nextjs_blue / nextjs_green) are NOT here.
# They are managed by deploy.sh.

services:
  redis:
    image: redis:7
    container_name: redis_server
    restart: always
    networks:
      - app_net

  nginx:
    build: ./nginx
    container_name: nginx_proxy
    restart: always
    networks:
      - app_net
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d   # upstream.conf is rewritten by deploy.sh on each deploy

  certbot:
    image: certbot/certbot
    restart: unless-stopped
    # auto-renews SSL certs every 12h

networks:
  app_net:
    name: app_net

Both sets of containers join the same explicit Docker network so nginx can reach the app containers by name.

6. GitHub environments for secrets

Secrets for staging live in a GitHub Environment named staging. Secrets for production will live in prod. Same secret names, different values. The workflow references the environment by name. Swapping is a one-line change. No renamed secrets, no duplicate workflow files.

yaml

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    environment: staging   # change this one line for prod

    steps:
      - name: Build and push Docker image
        env:
          IMAGE_TAG: staging-${{ github.sha }}
        run: |
          docker build \
            --build-arg NEXT_PUBLIC_APP_ENV=${{ secrets.NEXT_PUBLIC_APP_ENV }} \
            # secrets.NEXT_PUBLIC_APP_ENV resolves to "staging" here,
            # and to "production" in the prod environment - same key, different value

The GitHub Actions workflow

The full pipeline triggers on every push to the staging branch:

yaml

name: Deploy - Staging

on:
  push:
    branches:
      - staging

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    environment: staging   # secrets scoped to this GitHub Environment

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Login to Amazon ECR
        id: ecr-login
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push Docker image
        id: build
        env:
          IMAGE_TAG: staging-${{ github.sha }}   # never :latest
        run: |
          docker build \
            --build-arg NEXT_PUBLIC_APP_ENV=${{ secrets.NEXT_PUBLIC_APP_ENV }} \
            -t $ECR_REGISTRY/$ECR_REPO:$IMAGE_TAG \
            ./app
          docker push $ECR_REGISTRY/$ECR_REPO:$IMAGE_TAG

      - name: Deploy to EC2 via SSH
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.EC2_HOST }}
          username: ${{ secrets.EC2_USER }}
          key: ${{ secrets.EC2_SSH_KEY }}
          script: |
            IMAGE_TAG=${{ steps.build.outputs.image_tag }} \
            ECR_REGISTRY=${{ secrets.ECR_REGISTRY }} \
            bash ~/scripts/deploy.sh

The last step is the entire deploy: SSH into the instance, pass the image tag, run one script.

The deploy script

deploy.sh lives on the EC2 instance. GitHub Actions calls it over SSH with the image tag as an environment variable. Here is the full logic:

bash

#!/usr/bin/env bash
# Blue-green deploy script.
# Called by GitHub Actions via SSH. Also safe to run manually.
#
# Required env vars:
#   IMAGE_TAG      e.g. staging-a1b2c3d
#   ECR_REGISTRY   your ECR registry URL

set -euo pipefail

DOCKER_NETWORK="app_net"
UPSTREAM_CONF="~/app/nginx/conf.d/upstream.conf"
STATE_FILE="~/app/.active_color"
HEALTH_RETRIES=36   # 36 x 5s = 3 min max before rollback
HEALTH_INTERVAL=5

# Resolve which container is active and which is idle
ACTIVE_COLOR=$(cat "$STATE_FILE" 2>/dev/null || echo "none")

if [[ "$ACTIVE_COLOR" == "blue" ]]; then
    NEW_COLOR="green"; NEW_HOST_PORT=3001
    OLD_COLOR="blue";  OLD_HOST_PORT=3000
else
    NEW_COLOR="blue";  NEW_HOST_PORT=3000
    OLD_COLOR="green"; OLD_HOST_PORT=3001
fi

# [1/6] Login to ECR using IAM Instance Role - no static credentials needed
aws ecr get-login-password --region "$AWS_REGION" \
    | docker login --username AWS --password-stdin "$ECR_REGISTRY"

# [2/6] Pull the new image - always a specific SHA tag, never :latest
docker pull "$ECR_REGISTRY/$ECR_REPO:$IMAGE_TAG"

# [3/6] Start the idle container on a host port for health checking
docker stop "nextjs_${NEW_COLOR}" 2>/dev/null || true
docker rm   "nextjs_${NEW_COLOR}" 2>/dev/null || true

docker run -d \
    --name "nextjs_${NEW_COLOR}" \
    --network "$DOCKER_NETWORK" \
    --restart unless-stopped \
    -p "${NEW_HOST_PORT}:3000" \
    -e REDIS_URL="redis://redis:6379" \
    "$ECR_REGISTRY/$ECR_REPO:$IMAGE_TAG"

# [4/6] Health check - on failure, remove new container, old one keeps serving
for i in $(seq 1 "$HEALTH_RETRIES"); do
    if curl -sf --max-time 5 "http://localhost:${NEW_HOST_PORT}/" > /dev/null 2>&1; then
        echo "Health check passed (attempt $i)"
        break
    fi
    if [[ $i -eq $HEALTH_RETRIES ]]; then
        echo "Health check failed. Rolling back."
        docker stop "nextjs_${NEW_COLOR}" 2>/dev/null || true
        docker rm   "nextjs_${NEW_COLOR}" 2>/dev/null || true
        exit 1
    fi
    sleep "$HEALTH_INTERVAL"
done

# [5/6] Switch nginx to the new container - graceful reload, no dropped connections
cat > "$UPSTREAM_CONF" <<EOF
upstream nextjs_upstream {
    server nextjs_${NEW_COLOR}:3000;
}
EOF
docker exec nginx_proxy nginx -s reload

# [6/6] Drain and stop the old container
# nginx has already stopped routing new traffic here.
# SIGTERM lets Node.js finish in-flight requests before exiting.
docker stop --timeout 30 "nextjs_${OLD_COLOR}" 2>/dev/null || true
docker rm               "nextjs_${OLD_COLOR}" 2>/dev/null || true

# Record the new active color
echo "$NEW_COLOR" > "$STATE_FILE"

echo "Deploy complete. Active: ${NEW_COLOR} (${IMAGE_TAG})"

The health check is the safety net. If the new container fails to respond within 3 minutes, the script exits with an error, the new container is removed, and the old container is untouched and still serving. GitHub Actions marks the run as failed. You get a red check, the site never went down, and no manual intervention is needed.

💡 One caveat on the health check: the script hits http://localhost:${NEW_HOST_PORT}/, which is the root route. If your root route is slow to respond or returns a redirect, you can get false negatives. Point it at a dedicated health endpoint like /api/health that returns a simple 200 immediately. Faster and unambiguous.

nginx -s reload is not a restart. It is a graceful reload. nginx forks new worker processes with the updated config, lets the old workers finish their current requests, then exits them. In-flight connections on the old container complete normally. New connections go straight to the new container.

The script is also idempotent. It cleans up any leftover container from a previous failed deploy before starting the new one. Safe to run manually if needed.

What the final state looks like

bash

Push to 'staging' branch
   |
   v
GitHub Actions
   |-- Build Docker image (NEXT_PUBLIC_* baked in via --build-arg)
   |-- Push to ECR (tag: staging-{git-sha})
   `-- SSH into EC2 -> deploy.sh $IMAGE_TAG
         |-- Pull new image
         |-- Start nextjs_green on :3001
         |-- Health check (3 min timeout, auto-rollback on failure)
         |-- nginx -s reload -> traffic switches to green
         |-- docker stop --timeout 30 nextjs_blue
         `-- Write "green" to .active_color

No manual steps. No static credentials on the server. No downtime. Same EC2 instance, no new costs.

When to move to ECS

This setup works well for a single-instance deployment. You will outgrow it when you need multiple EC2 instances behind a real load balancer, when the operational overhead of managing EC2 is no longer worth it for your team size, or when you need auto-scaling based on traffic.

Until then, this is solid. It covers the 80% case: a production app that needs to deploy reliably without gaps, without the complexity or cost of a full container orchestration platform.

The boilerplate

The full setup including deploy.sh, the GitHub Actions workflow, the nginx config, and the Dockerfile is available as a boilerplate on GitHub. Structured to be adapted to any stack, not just Next.js.

kash8080/cicd_boilerplate

Built this while setting up our own staging environment. Happy to answer questions on any part of it in the replies.

Rahul Kashyap is CTO & Co-founder at Designare Solutions and DeepStory, based in Delhi.