Securing Our Docker Swarm with Overlay Networks

Disclaimer: This blog post is automatically generated from project documentation and technical proposals using AI assistance. The content represents our development journey and architectural decisions. Code examples are simplified illustrations and may not reflect the exact production implementation.

The Port Explosion Problem

I was reviewing our Docker Compose file when Caroline pointed out a problem: “We’re exposing way too many ports to the host.”

She was right. Our current configuration looked like this:

traefik:
  ports:
    - '8080:80'
    - '8443:443'

postgres:
  ports:
    - '5432:5432'

clickhouse:
  ports:
    - '8123:8123'
    - '9000:9000'

nats:
  ports:
    - '4222:4222'
    - '8222:8222'

“That’s six public ports just for infrastructure,” Caroline said. “And we’re not even using standard HTTP/HTTPS ports.”

Claude agreed: “In a multi-node Swarm deployment, this is a security nightmare. Anyone who knows your IP can probe those ports.”

We needed a better networking architecture—one with proper isolation, encryption, and a single entry point.

The Overlay Network Solution

Caroline suggested Docker Swarm’s overlay networks: “We create isolated networks for different tiers, and only expose Traefik on standard ports.”

Claude sketched out the architecture:

Internet (80/443) → Traefik → Overlay Networks → Services

The key insight: only Traefik touches the internet. Everything else communicates over encrypted overlay networks.

Network Tier Design

We designed four logical tiers:

web_tier: API, WebSocket, Events, UI (public-facing services)
infrastructure_tier: PostgreSQL, ClickHouse, NATS (databases and message bus)
job_tier: Sink, Realtime, Projections (background workers)
ml_tier: Text Classifier, Speech-to-Text (ML inference services)

Caroline noted: “Services can belong to multiple networks. The API needs access to both web_tier (for Traefik) and infrastructure_tier (for databases).”

Implementation: Network Definitions

We started with the network definitions in compose.yaml:

networks:
  web_tier:
    driver: overlay
    attachable: false
    encrypted: true
    labels:
      - 'tier=web'

  infrastructure_tier:
    driver: overlay
    attachable: false
    encrypted: true
    labels:
      - 'tier=infrastructure'

  job_tier:
    driver: overlay
    attachable: false
    encrypted: true
    labels:
      - 'tier=job'

  ml_tier:
    driver: overlay
    attachable: false
    encrypted: true
    labels:
      - 'tier=ml'

Claude explained the key properties:

driver: overlay - Multi-node network spanning the entire Swarm
attachable: false - Only services in the stack can attach (no ad-hoc containers)
encrypted: true - IPSec encryption for all traffic between nodes
labels - Helps with monitoring and filtering

Traefik: The Single Entry Point

Traefik is the only service exposed to the internet:

traefik:
  ports:
    - target: 80
      published: 80
      protocol: tcp
      mode: host
    - target: 443
      published: 443
      protocol: tcp
      mode: host
  networks:
    - web_tier # Routes to public-facing services
    - infrastructure_tier # Dashboard access (internal only)

Caroline noted: “We use mode: host for better performance—direct port binding instead of ingress load balancing.”

I asked: “Why does Traefik need infrastructure_tier access?”

“For the dashboard,” Claude answered. “We’ll configure it to be accessible only via specific Host headers, not public.”

Service Network Assignments

Next, we assigned each service to the appropriate networks:

# Web tier services
api:
  networks:
    - web_tier # Traefik routes here
    - infrastructure_tier # Needs database/NATS
  # NO ports exposed!

websocket:
  networks:
    - web_tier
    - infrastructure_tier
    - ml_tier # Needs ML services

events:
  networks:
    - web_tier
    - infrastructure_tier

ui:
  networks:
    - web_tier
    - infrastructure_tier

Caroline emphasized: “Notice we removed all ports configurations from these services. Traefik handles external access via HTTP Host headers.”

Infrastructure Services

For databases and message bus:

postgres:
  networks:
    - infrastructure_tier
  # NO ports exposed to host

clickhouse:
  networks:
    - infrastructure_tier

nats:
  networks:
    - infrastructure_tier

otel-collector:
  networks:
    - infrastructure_tier
    - web_tier # Collects metrics from all tiers
    - job_tier
    - ml_tier

I asked: “How do we access PostgreSQL for debugging if it’s not exposed?”

Claude answered: “You exec into a container on the same network: docker exec -it $(docker ps -q -f name=api) psql -h postgres -U admin”

Background Workers

Job tier services process events asynchronously:

sink:
  networks:
    - job_tier
    - infrastructure_tier # Needs NATS and ClickHouse

realtime:
  networks:
    - job_tier
    - infrastructure_tier

projections:
  networks:
    - job_tier
    - infrastructure_tier

ML Services

Machine learning services are completely isolated:

text-classifier:
  networks:
    - ml_tier
  # NO ports, NO infrastructure access

speech-to-text:
  networks:
    - ml_tier

Caroline explained: “ML services don’t need database access. They only receive requests from the WebSocket service over ml_tier.”

Service Discovery Magic

Caroline demonstrated how service discovery works:

// Old way (hardcoded IPs)
const nats = connect({ servers: ['nats://192.168.1.10:4222'] });

// New way (DNS-based discovery)
const nats = connect({ servers: ['nats://nats:4222'] });
const db = new Pool({ host: 'postgres', port: 5432 });
const clickhouse = createClient({ host: 'clickhouse' });

“Docker Swarm provides automatic DNS resolution,” she explained. “The service name (nats, postgres) resolves to all replicas via round-robin load balancing.”

I tested it:

docker exec -it $(docker ps -q -f name=api) sh
ping postgres
# PING postgres (10.0.3.2): 56 data bytes
# 64 bytes from 10.0.3.2: seq=0 ttl=64 time=0.123 ms

“It just works,” I said.

Traefik Configuration

Claude helped us configure Traefik for automatic service discovery:

# configs/traefik/traefik.yml
providers:
  docker:
    endpoint: 'unix:///var/run/docker.sock'
    exposedByDefault: false
    network: web_tier # Only discover services on web_tier
    swarmMode: true # Enable Swarm mode
    watch: true # Automatically detect changes

Caroline added labels to each public-facing service:

api:
  labels:
    - 'traefik.enable=true'
    - 'traefik.http.routers.api.rule=Host(`api.example.com`)'
    - 'traefik.http.services.api.loadbalancer.server.port=3000'

websocket:
  labels:
    - 'traefik.enable=true'
    - 'traefik.http.routers.ws.rule=Host(`stream.example.com`)'
    - 'traefik.http.services.ws.loadbalancer.server.port=8010'

“Now Traefik routes based on Host headers,” Caroline explained. “No need to remember port numbers.”

Security Benefits

Attack Surface Reduction

Caroline showed me the firewall configuration:

Before (old architecture):

# Exposed ports
80   (Traefik HTTP)
443  (Traefik HTTPS)
5432 (PostgreSQL)
8123 (ClickHouse HTTP)
9000 (ClickHouse native)
4222 (NATS)
3000 (API)
8010 (WebSocket)
8020 (Events)

After (overlay networks):

# Exposed ports
80   (Traefik HTTP)
443  (Traefik HTTPS)

“We went from 9 exposed ports to 2,” Caroline said. “That’s an 78% reduction in attack surface.”

Network Isolation

Claude explained the isolation benefits:

❌ API cannot access ml_tier (different network)
❌ ML services cannot access infrastructure_tier (different network)
❌ Internet cannot reach PostgreSQL (no public port)
✅ API can access postgres (both on infrastructure_tier)
✅ WebSocket can access text-classifier (both on ml_tier)

“Even if an attacker compromises the API service,” Claude said, “they can’t access ML services or job tier workers because those are on separate networks.”

Encrypted Communication

Caroline highlighted the encryption benefits:

networks:
  web_tier:
    encrypted: true # IPSec encryption for inter-node traffic

“All traffic between nodes is encrypted,” she explained. “If someone captures packets between your DigitalOcean droplets, they see encrypted data.”

Simplified Firewall Rules

I checked our DigitalOcean firewall configuration:

Before:

Allow TCP 80    from 0.0.0.0/0
Allow TCP 443   from 0.0.0.0/0
Allow TCP 5432  from 0.0.0.0/0  # Dangerous!
Allow TCP 8123  from 0.0.0.0/0  # Dangerous!
Allow TCP 4222  from 0.0.0.0/0  # Dangerous!
Allow TCP 3000  from 0.0.0.0/0
Allow TCP 8010  from 0.0.0.0/0

After:

Allow TCP 80    from 0.0.0.0/0
Allow TCP 443   from 0.0.0.0/0
Allow TCP 2377  from <manager-ips>    # Swarm management
Allow TCP 7946  from <swarm-ips>      # Swarm discovery
Allow UDP 7946  from <swarm-ips>      # Swarm discovery
Allow UDP 4789  from <swarm-ips>      # Overlay network

Caroline smiled: “Only 2 public ports. The rest are Swarm-specific and restricted to internal IPs.”

Scalability and Load Balancing

Claude demonstrated automatic load balancing:

api:
  deploy:
    replicas: 3 # 3 API instances

“Traefik automatically discovers all 3 replicas,” Claude explained. “It load balances requests across them using round-robin.”

Caroline added: “And if you scale up or down, Traefik updates automatically—no config changes needed.”

# Scale API to 5 replicas
docker service scale scores_api=5

# Traefik automatically detects:
# - api.1 on node1
# - api.2 on node2
# - api.3 on node3
# - api.4 on node1
# - api.5 on node2

High Availability Configuration

Caroline configured Traefik for high availability:

traefik:
  deploy:
    replicas: 3
    update_config:
      parallelism: 1 # Update one at a time
      delay: 10s # Wait 10s between updates
      order: start-first # Start new before stopping old

“This ensures zero downtime during Traefik updates,” she explained. “New containers start, wait for health checks, then old containers stop.”

I asked: “What if a Traefik instance crashes?”

“Docker Swarm automatically restarts it,” Claude answered. “And the other 2 instances continue serving traffic.”

Cost Savings

Caroline did the math on load balancer costs:

Before (using DigitalOcean Load Balancers):

Load Balancer for API:       $12/month
Load Balancer for WebSocket: $12/month
Load Balancer for UI:        $12/month
Total:                       $36/month

After (using Traefik):

Traefik (runs on droplets):  $0/month extra
Total:                        $0/month

“We save $36/month by using Traefik instead of external load balancers,” Caroline said. “Plus, we get better performance because there’s no extra network hop.”

Development/Production Parity

I tested the configuration locally:

# Local development
docker swarm init
docker stack deploy -c compose.yaml scores

# Production (DigitalOcean)
docker swarm init
docker stack deploy -c compose.yaml scores

“Same command, same configuration,” I noted. “No more docker-compose.dev.yaml vs docker-compose.prod.yaml.”

Caroline agreed: “This eliminates ‘works on my machine’ problems. If it works locally, it works in production.”

Deployment Checklist

Claude provided a production deployment checklist:

1. Create Droplets

# 3 manager nodes (HA)
doctl compute droplet create manager1 --size s-2vcpu-4gb --image ubuntu-22-04-x64
doctl compute droplet create manager2 --size s-2vcpu-4gb --image ubuntu-22-04-x64
doctl compute droplet create manager3 --size s-2vcpu-4gb --image ubuntu-22-04-x64

# 3 worker nodes (web tier)
doctl compute droplet create worker1 --size s-2vcpu-4gb --image ubuntu-22-04-x64
doctl compute droplet create worker2 --size s-2vcpu-4gb --image ubuntu-22-04-x64
doctl compute droplet create worker3 --size s-2vcpu-4gb --image ubuntu-22-04-x64

2. Initialize Swarm

# On manager1
docker swarm init --advertise-addr <manager1-ip>

# On manager2 & manager3
docker swarm join --token <manager-token> <manager1-ip>:2377

# On worker nodes
docker swarm join --token <worker-token> <manager1-ip>:2377

3. Label Nodes

# Infrastructure tier (managers run databases)
docker node update --label-add tier=infrastructure manager1
docker node update --label-add tier=infrastructure manager2
docker node update --label-add tier=infrastructure manager3

# Web tier (workers run public-facing services)
docker node update --label-add tier=web worker1
docker node update --label-add tier=web worker2
docker node update --label-add tier=web worker3

4. Deploy Stack

docker stack deploy -c compose.yaml scores

Caroline added: “That’s it. Docker Swarm creates all the overlay networks, starts services on the appropriate nodes, and Traefik starts routing traffic.”

Testing and Validation

We tested the deployment systematically:

Service Discovery

# From API container
docker exec -it $(docker ps -q -f name=api) sh
ping postgres   # ✅ Should succeed (same network)
ping nats       # ✅ Should succeed
ping text-classifier  # ❌ Should fail (different network)

External Access

# From internet
curl https://api.example.com/health  # ✅ Should succeed
curl https://example.com              # ✅ Should succeed (UI)
curl http://postgres:5432             # ❌ Connection refused (not exposed)

Network Isolation

# Try to connect to PostgreSQL from outside
psql -h <manager-ip> -U admin -d scores
# psql: error: connection to server at "<ip>", port 5432 failed

“Perfect,” Caroline said. “PostgreSQL is only accessible from inside the infrastructure_tier network.”

Observability

Caroline configured Traefik metrics:

metrics:
  otlp:
    addEntryPointsLabels: true
    addRoutersLabels: true
    addServicesLabels: true

“Now we get detailed metrics for every route,” she explained:

traefik_service_requests_total{service="api"} - Total requests
traefik_service_request_duration_seconds{service="api"} - Latency
traefik_service_open_connections{service="websocket"} - Active WebSocket connections

All metrics flow into our OpenTelemetry collector and appear in Grafana dashboards.

Lessons Learned

After deploying to production, we reflected on the experience:

What Worked Well

Single entry point - Only Traefik exposed, everything else internal
Network isolation - Services can only access what they need
Automatic service discovery - No hardcoded IPs
Zero downtime deployments - Rolling updates with health checks
Cost savings - No external load balancers needed

Challenges

Debugging - Had to learn docker exec patterns for accessing services
Network troubleshooting - docker network inspect became essential
Initial setup - Swarm initialization took some trial and error
Node placement - Had to think carefully about which services run where

Security Improvements

Caroline summarized the security wins:

Metric	Before	After	Improvement
Exposed ports	9 ports	2 ports	78% reduction
Public database access	❌ Yes	✅ No	Blocked
Encrypted communication	❌ No	✅ Yes	IPSec enabled
Attack surface	High	Low	Significantly reduced

Mermaid Diagram: Network Architecture

graph TB
    subgraph Internet["🌐 Internet"]
        USERS[Users]
    end

    subgraph PublicPorts["📡 Public Ports (80/443)"]
        TRAEFIK[Traefik<br/>Load Balancer]
    end

    subgraph WebTier["🎨 Web Tier Network (Overlay, Encrypted)"]
        API[API<br/>3 replicas]
        WS[WebSocket<br/>3 replicas]
        EVENTS[Events<br/>3 replicas]
        UI[UI<br/>3 replicas]
    end

    subgraph InfraTier["💾 Infrastructure Tier (Overlay, Encrypted)"]
        PG[(PostgreSQL)]
        CH[(ClickHouse)]
        NATS[(NATS)]
        OTEL[OpenTelemetry]
    end

    subgraph JobTier["⚙️ Job Tier (Overlay, Encrypted)"]
        SINK[Sink<br/>2 replicas]
        RT[Realtime<br/>2 replicas]
        PROJ[Projections<br/>1 replica]
    end

    subgraph MLTier["🤖 ML Tier (Overlay, Encrypted)"]
        TEXT[Text Classifier]
        SPEECH[Speech to Text]
    end

    USERS -->|HTTP/HTTPS| TRAEFIK
    TRAEFIK -.->|Host: api.example.com| API
    TRAEFIK -.->|Host: stream.example.com| WS
    TRAEFIK -.->|Host: events.example.com| EVENTS
    TRAEFIK -.->|Host: example.com| UI

    API --> PG
    API --> NATS
    WS --> PG
    WS --> TEXT
    WS --> SPEECH
    EVENTS --> NATS

    SINK --> NATS
    SINK --> CH
    RT --> NATS
    RT --> CH
    PROJ --> PG

    OTEL -.->|Metrics| API
    OTEL -.->|Metrics| WS
    OTEL -.->|Metrics| SINK

    classDef internet fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
    classDef public fill:#fff3e0,stroke:#f57c00,stroke-width:3px
    classDef web fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px
    classDef infra fill:#fce4ec,stroke:#c2185b,stroke-width:3px
    classDef job fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px
    classDef ml fill:#fff9c4,stroke:#f57f17,stroke-width:3px

    class USERS internet
    class TRAEFIK public
    class API,WS,EVENTS,UI web
    class PG,CH,NATS,OTEL infra
    class SINK,RT,PROJ job
    class TEXT,SPEECH ml

Takeaways

Overlay networks enable multi-node isolation - Services on different networks can’t communicate, even on the same host.
Traefik as the single entry point - Simplifies firewall rules, reduces attack surface, centralizes TLS.
Service discovery just works - DNS-based resolution with automatic load balancing across replicas.
Encrypted by default - IPSec encryption for all inter-node traffic with encrypted: true.
Zero-cost load balancing - Traefik runs on your existing droplets, no need for external load balancers.

Caroline summed it up: “This architecture is production-ready. We have security, scalability, and observability.”

Claude agreed: “And it scales from a single-node dev environment to a multi-node production cluster with zero config changes.”

I was just happy our PostgreSQL database was no longer exposed to the internet.

The Port Explosion Problem#

The Overlay Network Solution#

Network Tier Design#

Implementation: Network Definitions#

Traefik: The Single Entry Point#

Service Network Assignments#

Infrastructure Services#

Background Workers#

ML Services#

Service Discovery Magic#

Traefik Configuration#

Security Benefits#

Attack Surface Reduction#

Network Isolation#

Encrypted Communication#

Simplified Firewall Rules#

Scalability and Load Balancing#

High Availability Configuration#

Cost Savings#

Development/Production Parity#

Deployment Checklist#

1. Create Droplets#

2. Initialize Swarm#

3. Label Nodes#

4. Deploy Stack#

Testing and Validation#

Service Discovery#

External Access#

Network Isolation#

Observability#

Lessons Learned#

What Worked Well#

Challenges#

Security Improvements#

Mermaid Diagram: Network Architecture#

Takeaways#

The Port Explosion Problem

The Overlay Network Solution

Network Tier Design

Implementation: Network Definitions

Traefik: The Single Entry Point

Service Network Assignments

Infrastructure Services

Background Workers

ML Services

Service Discovery Magic

Traefik Configuration

Security Benefits

Attack Surface Reduction

Network Isolation

Encrypted Communication

Simplified Firewall Rules

Scalability and Load Balancing

High Availability Configuration

Cost Savings

Development/Production Parity

Deployment Checklist

1. Create Droplets

2. Initialize Swarm

3. Label Nodes

4. Deploy Stack

Testing and Validation

Service Discovery

External Access

Network Isolation

Observability

Lessons Learned

What Worked Well

Challenges

Security Improvements

Mermaid Diagram: Network Architecture

Takeaways