Disclaimer: This blog post is automatically generated from project documentation and technical proposals using AI assistance. The content represents our development journey and architectural decisions. Code examples are simplified illustrations and may not reflect the exact production implementation.
The Port Explosion Problem
I was reviewing our Docker Compose file when Caroline pointed out a problem: “We’re exposing way too many ports to the host.”
She was right. Our current configuration looked like this:
traefik:
ports:
- '8080:80'
- '8443:443'
postgres:
ports:
- '5432:5432'
clickhouse:
ports:
- '8123:8123'
- '9000:9000'
nats:
ports:
- '4222:4222'
- '8222:8222'
“That’s six public ports just for infrastructure,” Caroline said. “And we’re not even using standard HTTP/HTTPS ports.”
Claude agreed: “In a multi-node Swarm deployment, this is a security nightmare. Anyone who knows your IP can probe those ports.”
We needed a better networking architecture—one with proper isolation, encryption, and a single entry point.
The Overlay Network Solution
Caroline suggested Docker Swarm’s overlay networks: “We create isolated networks for different tiers, and only expose Traefik on standard ports.”
Claude sketched out the architecture:
Internet (80/443) → Traefik → Overlay Networks → Services
The key insight: only Traefik touches the internet. Everything else communicates over encrypted overlay networks.
Network Tier Design
We designed four logical tiers:
- web_tier: API, WebSocket, Events, UI (public-facing services)
- infrastructure_tier: PostgreSQL, ClickHouse, NATS (databases and message bus)
- job_tier: Sink, Realtime, Projections (background workers)
- ml_tier: Text Classifier, Speech-to-Text (ML inference services)
Caroline noted: “Services can belong to multiple networks. The API needs access to both web_tier (for Traefik) and infrastructure_tier (for databases).”
Implementation: Network Definitions
We started with the network definitions in compose.yaml:
networks:
web_tier:
driver: overlay
attachable: false
encrypted: true
labels:
- 'tier=web'
infrastructure_tier:
driver: overlay
attachable: false
encrypted: true
labels:
- 'tier=infrastructure'
job_tier:
driver: overlay
attachable: false
encrypted: true
labels:
- 'tier=job'
ml_tier:
driver: overlay
attachable: false
encrypted: true
labels:
- 'tier=ml'
Claude explained the key properties:
- driver: overlay - Multi-node network spanning the entire Swarm
- attachable: false - Only services in the stack can attach (no ad-hoc containers)
- encrypted: true - IPSec encryption for all traffic between nodes
- labels - Helps with monitoring and filtering
Traefik: The Single Entry Point
Traefik is the only service exposed to the internet:
traefik:
ports:
- target: 80
published: 80
protocol: tcp
mode: host
- target: 443
published: 443
protocol: tcp
mode: host
networks:
- web_tier # Routes to public-facing services
- infrastructure_tier # Dashboard access (internal only)
Caroline noted: “We use mode: host for better performance—direct port binding instead of ingress load balancing.”
I asked: “Why does Traefik need infrastructure_tier access?”
“For the dashboard,” Claude answered. “We’ll configure it to be accessible only via specific Host headers, not public.”
Service Network Assignments
Next, we assigned each service to the appropriate networks:
# Web tier services
api:
networks:
- web_tier # Traefik routes here
- infrastructure_tier # Needs database/NATS
# NO ports exposed!
websocket:
networks:
- web_tier
- infrastructure_tier
- ml_tier # Needs ML services
events:
networks:
- web_tier
- infrastructure_tier
ui:
networks:
- web_tier
- infrastructure_tier
Caroline emphasized: “Notice we removed all ports configurations from these services. Traefik handles external access via HTTP Host headers.”
Infrastructure Services
For databases and message bus:
postgres:
networks:
- infrastructure_tier
# NO ports exposed to host
clickhouse:
networks:
- infrastructure_tier
nats:
networks:
- infrastructure_tier
otel-collector:
networks:
- infrastructure_tier
- web_tier # Collects metrics from all tiers
- job_tier
- ml_tier
I asked: “How do we access PostgreSQL for debugging if it’s not exposed?”
Claude answered: “You exec into a container on the same network: docker exec -it $(docker ps -q -f name=api) psql -h postgres -U admin”
Background Workers
Job tier services process events asynchronously:
sink:
networks:
- job_tier
- infrastructure_tier # Needs NATS and ClickHouse
realtime:
networks:
- job_tier
- infrastructure_tier
projections:
networks:
- job_tier
- infrastructure_tier
ML Services
Machine learning services are completely isolated:
text-classifier:
networks:
- ml_tier
# NO ports, NO infrastructure access
speech-to-text:
networks:
- ml_tier
Caroline explained: “ML services don’t need database access. They only receive requests from the WebSocket service over ml_tier.”
Service Discovery Magic
Caroline demonstrated how service discovery works:
// Old way (hardcoded IPs)
const nats = connect({ servers: ['nats://192.168.1.10:4222'] });
// New way (DNS-based discovery)
const nats = connect({ servers: ['nats://nats:4222'] });
const db = new Pool({ host: 'postgres', port: 5432 });
const clickhouse = createClient({ host: 'clickhouse' });
“Docker Swarm provides automatic DNS resolution,” she explained. “The service name (nats, postgres) resolves to all replicas via round-robin load balancing.”
I tested it:
docker exec -it $(docker ps -q -f name=api) sh
ping postgres
# PING postgres (10.0.3.2): 56 data bytes
# 64 bytes from 10.0.3.2: seq=0 ttl=64 time=0.123 ms
“It just works,” I said.
Traefik Configuration
Claude helped us configure Traefik for automatic service discovery:
# configs/traefik/traefik.yml
providers:
docker:
endpoint: 'unix:///var/run/docker.sock'
exposedByDefault: false
network: web_tier # Only discover services on web_tier
swarmMode: true # Enable Swarm mode
watch: true # Automatically detect changes
Caroline added labels to each public-facing service:
api:
labels:
- 'traefik.enable=true'
- 'traefik.http.routers.api.rule=Host(`api.example.com`)'
- 'traefik.http.services.api.loadbalancer.server.port=3000'
websocket:
labels:
- 'traefik.enable=true'
- 'traefik.http.routers.ws.rule=Host(`stream.example.com`)'
- 'traefik.http.services.ws.loadbalancer.server.port=8010'
“Now Traefik routes based on Host headers,” Caroline explained. “No need to remember port numbers.”
Security Benefits
Attack Surface Reduction
Caroline showed me the firewall configuration:
Before (old architecture):
# Exposed ports
80 (Traefik HTTP)
443 (Traefik HTTPS)
5432 (PostgreSQL)
8123 (ClickHouse HTTP)
9000 (ClickHouse native)
4222 (NATS)
3000 (API)
8010 (WebSocket)
8020 (Events)
After (overlay networks):
# Exposed ports
80 (Traefik HTTP)
443 (Traefik HTTPS)
“We went from 9 exposed ports to 2,” Caroline said. “That’s an 78% reduction in attack surface.”
Network Isolation
Claude explained the isolation benefits:
❌ API cannot access ml_tier (different network)
❌ ML services cannot access infrastructure_tier (different network)
❌ Internet cannot reach PostgreSQL (no public port)
✅ API can access postgres (both on infrastructure_tier)
✅ WebSocket can access text-classifier (both on ml_tier)
“Even if an attacker compromises the API service,” Claude said, “they can’t access ML services or job tier workers because those are on separate networks.”
Encrypted Communication
Caroline highlighted the encryption benefits:
networks:
web_tier:
encrypted: true # IPSec encryption for inter-node traffic
“All traffic between nodes is encrypted,” she explained. “If someone captures packets between your DigitalOcean droplets, they see encrypted data.”
Simplified Firewall Rules
I checked our DigitalOcean firewall configuration:
Before:
Allow TCP 80 from 0.0.0.0/0
Allow TCP 443 from 0.0.0.0/0
Allow TCP 5432 from 0.0.0.0/0 # Dangerous!
Allow TCP 8123 from 0.0.0.0/0 # Dangerous!
Allow TCP 4222 from 0.0.0.0/0 # Dangerous!
Allow TCP 3000 from 0.0.0.0/0
Allow TCP 8010 from 0.0.0.0/0
After:
Allow TCP 80 from 0.0.0.0/0
Allow TCP 443 from 0.0.0.0/0
Allow TCP 2377 from <manager-ips> # Swarm management
Allow TCP 7946 from <swarm-ips> # Swarm discovery
Allow UDP 7946 from <swarm-ips> # Swarm discovery
Allow UDP 4789 from <swarm-ips> # Overlay network
Caroline smiled: “Only 2 public ports. The rest are Swarm-specific and restricted to internal IPs.”
Scalability and Load Balancing
Claude demonstrated automatic load balancing:
api:
deploy:
replicas: 3 # 3 API instances
“Traefik automatically discovers all 3 replicas,” Claude explained. “It load balances requests across them using round-robin.”
Caroline added: “And if you scale up or down, Traefik updates automatically—no config changes needed.”
# Scale API to 5 replicas
docker service scale scores_api=5
# Traefik automatically detects:
# - api.1 on node1
# - api.2 on node2
# - api.3 on node3
# - api.4 on node1
# - api.5 on node2
High Availability Configuration
Caroline configured Traefik for high availability:
traefik:
deploy:
replicas: 3
update_config:
parallelism: 1 # Update one at a time
delay: 10s # Wait 10s between updates
order: start-first # Start new before stopping old
“This ensures zero downtime during Traefik updates,” she explained. “New containers start, wait for health checks, then old containers stop.”
I asked: “What if a Traefik instance crashes?”
“Docker Swarm automatically restarts it,” Claude answered. “And the other 2 instances continue serving traffic.”
Cost Savings
Caroline did the math on load balancer costs:
Before (using DigitalOcean Load Balancers):
Load Balancer for API: $12/month
Load Balancer for WebSocket: $12/month
Load Balancer for UI: $12/month
Total: $36/month
After (using Traefik):
Traefik (runs on droplets): $0/month extra
Total: $0/month
“We save $36/month by using Traefik instead of external load balancers,” Caroline said. “Plus, we get better performance because there’s no extra network hop.”
Development/Production Parity
I tested the configuration locally:
# Local development
docker swarm init
docker stack deploy -c compose.yaml scores
# Production (DigitalOcean)
docker swarm init
docker stack deploy -c compose.yaml scores
“Same command, same configuration,” I noted. “No more docker-compose.dev.yaml vs docker-compose.prod.yaml.”
Caroline agreed: “This eliminates ‘works on my machine’ problems. If it works locally, it works in production.”
Deployment Checklist
Claude provided a production deployment checklist:
1. Create Droplets
# 3 manager nodes (HA)
doctl compute droplet create manager1 --size s-2vcpu-4gb --image ubuntu-22-04-x64
doctl compute droplet create manager2 --size s-2vcpu-4gb --image ubuntu-22-04-x64
doctl compute droplet create manager3 --size s-2vcpu-4gb --image ubuntu-22-04-x64
# 3 worker nodes (web tier)
doctl compute droplet create worker1 --size s-2vcpu-4gb --image ubuntu-22-04-x64
doctl compute droplet create worker2 --size s-2vcpu-4gb --image ubuntu-22-04-x64
doctl compute droplet create worker3 --size s-2vcpu-4gb --image ubuntu-22-04-x64
2. Initialize Swarm
# On manager1
docker swarm init --advertise-addr <manager1-ip>
# On manager2 & manager3
docker swarm join --token <manager-token> <manager1-ip>:2377
# On worker nodes
docker swarm join --token <worker-token> <manager1-ip>:2377
3. Label Nodes
# Infrastructure tier (managers run databases)
docker node update --label-add tier=infrastructure manager1
docker node update --label-add tier=infrastructure manager2
docker node update --label-add tier=infrastructure manager3
# Web tier (workers run public-facing services)
docker node update --label-add tier=web worker1
docker node update --label-add tier=web worker2
docker node update --label-add tier=web worker3
4. Deploy Stack
docker stack deploy -c compose.yaml scores
Caroline added: “That’s it. Docker Swarm creates all the overlay networks, starts services on the appropriate nodes, and Traefik starts routing traffic.”
Testing and Validation
We tested the deployment systematically:
Service Discovery
# From API container
docker exec -it $(docker ps -q -f name=api) sh
ping postgres # ✅ Should succeed (same network)
ping nats # ✅ Should succeed
ping text-classifier # ❌ Should fail (different network)
External Access
# From internet
curl https://api.example.com/health # ✅ Should succeed
curl https://example.com # ✅ Should succeed (UI)
curl http://postgres:5432 # ❌ Connection refused (not exposed)
Network Isolation
# Try to connect to PostgreSQL from outside
psql -h <manager-ip> -U admin -d scores
# psql: error: connection to server at "<ip>", port 5432 failed
“Perfect,” Caroline said. “PostgreSQL is only accessible from inside the infrastructure_tier network.”
Observability
Caroline configured Traefik metrics:
metrics:
otlp:
addEntryPointsLabels: true
addRoutersLabels: true
addServicesLabels: true
“Now we get detailed metrics for every route,” she explained:
traefik_service_requests_total{service="api"}- Total requeststraefik_service_request_duration_seconds{service="api"}- Latencytraefik_service_open_connections{service="websocket"}- Active WebSocket connections
All metrics flow into our OpenTelemetry collector and appear in Grafana dashboards.
Lessons Learned
After deploying to production, we reflected on the experience:
What Worked Well
- Single entry point - Only Traefik exposed, everything else internal
- Network isolation - Services can only access what they need
- Automatic service discovery - No hardcoded IPs
- Zero downtime deployments - Rolling updates with health checks
- Cost savings - No external load balancers needed
Challenges
- Debugging - Had to learn
docker execpatterns for accessing services - Network troubleshooting -
docker network inspectbecame essential - Initial setup - Swarm initialization took some trial and error
- Node placement - Had to think carefully about which services run where
Security Improvements
Caroline summarized the security wins:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Exposed ports | 9 ports | 2 ports | 78% reduction |
| Public database access | ❌ Yes | ✅ No | Blocked |
| Encrypted communication | ❌ No | ✅ Yes | IPSec enabled |
| Attack surface | High | Low | Significantly reduced |
Mermaid Diagram: Network Architecture
graph TB
subgraph Internet["🌐 Internet"]
USERS[Users]
end
subgraph PublicPorts["📡 Public Ports (80/443)"]
TRAEFIK[Traefik<br/>Load Balancer]
end
subgraph WebTier["🎨 Web Tier Network (Overlay, Encrypted)"]
API[API<br/>3 replicas]
WS[WebSocket<br/>3 replicas]
EVENTS[Events<br/>3 replicas]
UI[UI<br/>3 replicas]
end
subgraph InfraTier["💾 Infrastructure Tier (Overlay, Encrypted)"]
PG[(PostgreSQL)]
CH[(ClickHouse)]
NATS[(NATS)]
OTEL[OpenTelemetry]
end
subgraph JobTier["⚙️ Job Tier (Overlay, Encrypted)"]
SINK[Sink<br/>2 replicas]
RT[Realtime<br/>2 replicas]
PROJ[Projections<br/>1 replica]
end
subgraph MLTier["🤖 ML Tier (Overlay, Encrypted)"]
TEXT[Text Classifier]
SPEECH[Speech to Text]
end
USERS -->|HTTP/HTTPS| TRAEFIK
TRAEFIK -.->|Host: api.example.com| API
TRAEFIK -.->|Host: stream.example.com| WS
TRAEFIK -.->|Host: events.example.com| EVENTS
TRAEFIK -.->|Host: example.com| UI
API --> PG
API --> NATS
WS --> PG
WS --> TEXT
WS --> SPEECH
EVENTS --> NATS
SINK --> NATS
SINK --> CH
RT --> NATS
RT --> CH
PROJ --> PG
OTEL -.->|Metrics| API
OTEL -.->|Metrics| WS
OTEL -.->|Metrics| SINK
classDef internet fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
classDef public fill:#fff3e0,stroke:#f57c00,stroke-width:3px
classDef web fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px
classDef infra fill:#fce4ec,stroke:#c2185b,stroke-width:3px
classDef job fill:#f3e5f5,stroke:#7b1fa2,stroke-width:3px
classDef ml fill:#fff9c4,stroke:#f57f17,stroke-width:3px
class USERS internet
class TRAEFIK public
class API,WS,EVENTS,UI web
class PG,CH,NATS,OTEL infra
class SINK,RT,PROJ job
class TEXT,SPEECH ml
Takeaways
- Overlay networks enable multi-node isolation - Services on different networks can’t communicate, even on the same host.
- Traefik as the single entry point - Simplifies firewall rules, reduces attack surface, centralizes TLS.
- Service discovery just works - DNS-based resolution with automatic load balancing across replicas.
- Encrypted by default - IPSec encryption for all inter-node traffic with
encrypted: true. - Zero-cost load balancing - Traefik runs on your existing droplets, no need for external load balancers.
Caroline summed it up: “This architecture is production-ready. We have security, scalability, and observability.”
Claude agreed: “And it scales from a single-node dev environment to a multi-node production cluster with zero config changes.”
I was just happy our PostgreSQL database was no longer exposed to the internet.