Deploying llama.cpp as an API Server on Docker Swarm

J
Juaji Admin
March 27, 2026 · 4 min read
0 comments ·

Deploying llama.cpp as an API Server on Docker Swarm

In a previous post, we covered running Qwen3 locally with llama.cpp. Now let’s take it to production by deploying the llama-server (OpenAI-compatible API) as a Docker Swarm service.

This gives you a self-hosted LLM endpoint with automatic restarts, resource limits, health checks, and easy scaling — all without Kubernetes.

Architecture

┌─────────────────────────────────────────────────┐
│  Docker Swarm Node                              │
│                                                 │
│  ┌─────────────┐    ┌────────────────────────┐  │
│  │   Traefik    │───▶│  llama-server          │  │
│  │   Reverse    │    │  :8081                 │  │
│  │   Proxy      │    │  OpenAI-compatible API │  │
│  └─────────────┘    └────────────────────────┘  │
│                              │                  │
│                     ┌────────┴───────┐          │
│                     │  /models/      │          │
│                     │  (bind mount)  │          │
│                     └────────────────┘          │
└─────────────────────────────────────────────────┘

Step 1: Create the Dockerfile

We’ll build llama.cpp from source inside a multi-stage Docker build:

# Stage 1: Build llama.cpp
FROM ubuntu:22.04 AS builder

RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/ggml-org/llama.cpp /build/llama.cpp
WORKDIR /build/llama.cpp
RUN cmake -B build && cmake --build build --config Release -j $(nproc)

# Stage 2: Runtime
FROM ubuntu:22.04

RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy only the server binary
COPY --from=builder /build/llama.cpp/build/bin/llama-server /usr/local/bin/llama-server

# Create non-root user
RUN useradd -r -s /bin/false llama

# Models directory
RUN mkdir -p /models && chown llama:llama /models
VOLUME /models

USER llama
EXPOSE 8081

ENTRYPOINT ["llama-server"]
CMD ["-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8081", "-t", "16", "-c", "8192", "--temp", "0.6"]

Build and push to your registry:

docker build -t your-registry.com/llama-server:latest .
docker push your-registry.com/llama-server:latest

Step 2: Download the Model

Place the GGUF model on the Swarm node(s) where the service will run:

mkdir -p /opt/llama/models
curl -L -o /opt/llama/models/model.gguf \
  "https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"

Step 3: Docker Compose for Swarm

Create docker-compose.llama.yml:

version: "3.8"

services:
  llama-server:
    image: your-registry.com/llama-server:latest
    command:
      - "-m"
      - "/models/model.gguf"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8081"
      - "-t"
      - "16"
      - "-c"
      - "8192"
      - "--temp"
      - "0.6"
      - "--top-k"
      - "20"
      - "--top-p"
      - "0.95"
    volumes:
      - /opt/llama/models:/models:ro
    deploy:
      replicas: 1
      resources:
        limits:
          cpus: "16"
          memory: 12G
        reservations:
          cpus: "8"
          memory: 6G
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 5
      placement:
        constraints:
          - node.role == manager
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.llama.rule=Host(`llm.yourdomain.com`)"
        - "traefik.http.routers.llama.entrypoints=websecure"
        - "traefik.http.routers.llama.tls.certresolver=letsencrypt"
        - "traefik.http.services.llama.loadbalancer.server.port=8081"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8081/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    networks:
      - proxy

networks:
  proxy:
    external: true

Key Swarm Configuration Explained

Setting Purpose
resources.limits.cpus: "16" Cap CPU usage — don’t starve other services
resources.limits.memory: 12G Model uses ~6GB, leave headroom for context
restart_policy Auto-restart on crashes with backoff
placement.constraints Pin to specific node(s) where the model file lives
healthcheck Swarm will restart if /health endpoint fails
volumes: ro Mount models read-only for security

Step 4: Deploy to Swarm

docker stack deploy -c docker-compose.llama.yml llama

Check the service status:

# Service status
docker service ls --filter name=llama

# Logs
docker service logs llama_llama-server --tail 50 -f

# Task status
docker service ps llama_llama-server

Step 5: Test the API

Once the health check passes, test the OpenAI-compatible endpoint:

curl https://llm.yourdomain.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Docker Swarm in 3 sentences."}
    ],
    "temperature": 0.6,
    "max_tokens": 500
  }'

You can also list available models:

curl https://llm.yourdomain.com/v1/models

Step 6: Integrate with Your Services

Since the API is OpenAI-compatible, any SDK or tool that works with OpenAI will work here. Just change the base URL:

Python (openai SDK):

from openai import OpenAI

client = OpenAI(
    base_url="http://llama_llama-server:8081/v1",  # internal Swarm DNS
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="qwen3",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.6
)
print(response.choices[0].message.content)

Go:

req := map[string]any{
    "model": "qwen3",
    "messages": []map[string]string{
        {"role": "user", "content": "Hello!"},
    },
    "temperature": 0.6,
}
body, _ := json.Marshal(req)
resp, _ := http.Post(
    "http://llama_llama-server:8081/v1/chat/completions",
    "application/json",
    bytes.NewReader(body),
)

Note the internal Swarm DNS name: llama_llama-server (stack name + service name).

Securing the Endpoint

The llama-server has no built-in authentication. Options:

  1. API key via Traefik middleware — add a forwardAuth or basic auth middleware in the Traefik labels
  2. Internal only — remove the Traefik labels and only expose on the Docker overlay network, accessible solely by other Swarm services
  3. IP allowlist — restrict access to known IPs via Traefik’s ipWhiteList middleware

For internal-only access (recommended), remove the Traefik labels and other services connect via http://llama_llama-server:8081.

Updating the Model

To swap models without downtime:

# Download new model
curl -L -o /opt/llama/models/model-new.gguf "https://..."

# Swap atomically
mv /opt/llama/models/model-new.gguf /opt/llama/models/model.gguf

# Force service restart to pick up new model
docker service update --force llama_llama-server

Monitoring

llama-server exposes metrics at /metrics in Prometheus format. Add a scrape target:

# prometheus.yml
scrape_configs:
  - job_name: llama-server
    static_configs:
      - targets: ['llama_llama-server:8081']

Key metrics to watch:

  • llamacpp:prompt_tokens_total — prompt processing load
  • llamacpp:tokens_predicted_total — generation throughput
  • llamacpp:prompt_seconds_total — latency tracking

Conclusion

Docker Swarm gives you a lightweight way to run llama.cpp in production with health checks, resource limits, and automatic restarts — all in a single compose file. Combined with Traefik for TLS and routing, you get a secure, self-hosted LLM API that’s a drop-in replacement for OpenAI.


Deployed on a Contabo VPS running Docker Swarm with Traefik reverse proxy. See the companion article for the local setup guide.

Discussion 0

Markdown supported 0 / 2,000

No comments yet

Be the first to share your perspective