Deploying llama.cpp as an API Server on Docker Swarm
In a previous post, we covered running Qwen3 locally with llama.cpp. Now let’s take it to production by deploying the llama-server (OpenAI-compatible API) as a Docker Swarm service.
This gives you a self-hosted LLM endpoint with automatic restarts, resource limits, health checks, and easy scaling — all without Kubernetes.
Architecture
┌─────────────────────────────────────────────────┐
│ Docker Swarm Node │
│ │
│ ┌─────────────┐ ┌────────────────────────┐ │
│ │ Traefik │───▶│ llama-server │ │
│ │ Reverse │ │ :8081 │ │
│ │ Proxy │ │ OpenAI-compatible API │ │
│ └─────────────┘ └────────────────────────┘ │
│ │ │
│ ┌────────┴───────┐ │
│ │ /models/ │ │
│ │ (bind mount) │ │
│ └────────────────┘ │
└─────────────────────────────────────────────────┘
Step 1: Create the Dockerfile
We’ll build llama.cpp from source inside a multi-stage Docker build:
# Stage 1: Build llama.cpp
FROM ubuntu:22.04 AS builder
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
&& rm -rf /var/lib/apt/lists/*
RUN git clone https://github.com/ggml-org/llama.cpp /build/llama.cpp
WORKDIR /build/llama.cpp
RUN cmake -B build && cmake --build build --config Release -j $(nproc)
# Stage 2: Runtime
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copy only the server binary
COPY --from=builder /build/llama.cpp/build/bin/llama-server /usr/local/bin/llama-server
# Create non-root user
RUN useradd -r -s /bin/false llama
# Models directory
RUN mkdir -p /models && chown llama:llama /models
VOLUME /models
USER llama
EXPOSE 8081
ENTRYPOINT ["llama-server"]
CMD ["-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8081", "-t", "16", "-c", "8192", "--temp", "0.6"]
Build and push to your registry:
docker build -t your-registry.com/llama-server:latest .
docker push your-registry.com/llama-server:latest
Step 2: Download the Model
Place the GGUF model on the Swarm node(s) where the service will run:
mkdir -p /opt/llama/models
curl -L -o /opt/llama/models/model.gguf \
"https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"
Step 3: Docker Compose for Swarm
Create docker-compose.llama.yml:
version: "3.8"
services:
llama-server:
image: your-registry.com/llama-server:latest
command:
- "-m"
- "/models/model.gguf"
- "--host"
- "0.0.0.0"
- "--port"
- "8081"
- "-t"
- "16"
- "-c"
- "8192"
- "--temp"
- "0.6"
- "--top-k"
- "20"
- "--top-p"
- "0.95"
volumes:
- /opt/llama/models:/models:ro
deploy:
replicas: 1
resources:
limits:
cpus: "16"
memory: 12G
reservations:
cpus: "8"
memory: 6G
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 5
placement:
constraints:
- node.role == manager
labels:
- "traefik.enable=true"
- "traefik.http.routers.llama.rule=Host(`llm.yourdomain.com`)"
- "traefik.http.routers.llama.entrypoints=websecure"
- "traefik.http.routers.llama.tls.certresolver=letsencrypt"
- "traefik.http.services.llama.loadbalancer.server.port=8081"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8081/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
networks:
- proxy
networks:
proxy:
external: true
Key Swarm Configuration Explained
| Setting | Purpose |
|---|---|
resources.limits.cpus: "16" |
Cap CPU usage — don’t starve other services |
resources.limits.memory: 12G |
Model uses ~6GB, leave headroom for context |
restart_policy |
Auto-restart on crashes with backoff |
placement.constraints |
Pin to specific node(s) where the model file lives |
healthcheck |
Swarm will restart if /health endpoint fails |
volumes: ro |
Mount models read-only for security |
Step 4: Deploy to Swarm
docker stack deploy -c docker-compose.llama.yml llama
Check the service status:
# Service status
docker service ls --filter name=llama
# Logs
docker service logs llama_llama-server --tail 50 -f
# Task status
docker service ps llama_llama-server
Step 5: Test the API
Once the health check passes, test the OpenAI-compatible endpoint:
curl https://llm.yourdomain.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Docker Swarm in 3 sentences."}
],
"temperature": 0.6,
"max_tokens": 500
}'
You can also list available models:
curl https://llm.yourdomain.com/v1/models
Step 6: Integrate with Your Services
Since the API is OpenAI-compatible, any SDK or tool that works with OpenAI will work here. Just change the base URL:
Python (openai SDK):
from openai import OpenAI
client = OpenAI(
base_url="http://llama_llama-server:8081/v1", # internal Swarm DNS
api_key="not-needed"
)
response = client.chat.completions.create(
model="qwen3",
messages=[{"role": "user", "content": "Hello!"}],
temperature=0.6
)
print(response.choices[0].message.content)
Go:
req := map[string]any{
"model": "qwen3",
"messages": []map[string]string{
{"role": "user", "content": "Hello!"},
},
"temperature": 0.6,
}
body, _ := json.Marshal(req)
resp, _ := http.Post(
"http://llama_llama-server:8081/v1/chat/completions",
"application/json",
bytes.NewReader(body),
)
Note the internal Swarm DNS name: llama_llama-server (stack name + service name).
Securing the Endpoint
The llama-server has no built-in authentication. Options:
- API key via Traefik middleware — add a
forwardAuthor basic auth middleware in the Traefik labels - Internal only — remove the Traefik labels and only expose on the Docker overlay network, accessible solely by other Swarm services
- IP allowlist — restrict access to known IPs via Traefik’s
ipWhiteListmiddleware
For internal-only access (recommended), remove the Traefik labels and other services connect via http://llama_llama-server:8081.
Updating the Model
To swap models without downtime:
# Download new model
curl -L -o /opt/llama/models/model-new.gguf "https://..."
# Swap atomically
mv /opt/llama/models/model-new.gguf /opt/llama/models/model.gguf
# Force service restart to pick up new model
docker service update --force llama_llama-server
Monitoring
llama-server exposes metrics at /metrics in Prometheus format. Add a scrape target:
# prometheus.yml
scrape_configs:
- job_name: llama-server
static_configs:
- targets: ['llama_llama-server:8081']
Key metrics to watch:
llamacpp:prompt_tokens_total— prompt processing loadllamacpp:tokens_predicted_total— generation throughputllamacpp:prompt_seconds_total— latency tracking
Conclusion
Docker Swarm gives you a lightweight way to run llama.cpp in production with health checks, resource limits, and automatic restarts — all in a single compose file. Combined with Traefik for TLS and routing, you get a secure, self-hosted LLM API that’s a drop-in replacement for OpenAI.
Deployed on a Contabo VPS running Docker Swarm with Traefik reverse proxy. See the companion article for the local setup guide.
Discussion 0
No comments yet
Be the first to share your perspective