Running Qwen3 Locally with llama.cpp on a CPU Server

J
Juaji Admin
March 27, 2026 · 3 min read
0 comments ·

Running Qwen3 Locally with llama.cpp on a CPU Server

You don’t need a GPU to run large language models. In this guide, I’ll walk you through setting up llama.cpp with Qwen3-8B on a CPU-only Linux server — the same setup we use internally at Juaji.

Why llama.cpp?

llama.cpp is a high-performance C++ inference engine for large language models. Key advantages:

  • No GPU required — runs efficiently on CPU with AVX2/AVX-512 optimisation
  • Low memory footprint — quantised GGUF models can fit in as little as 4GB RAM
  • Simple deployment — single binary, no Python dependencies
  • OpenAI-compatible server — drop-in replacement for API integrations

Prerequisites

  • Linux x86_64 server (we’re using Ubuntu 22.04 on a Contabo VPS)
  • 16+ GB RAM (our model uses ~5GB)
  • Build tools: gcc, cmake
  • ~5GB disk space for the model

Step 1: Install Build Tools

sudo apt install -y build-essential cmake

Step 2: Clone and Build llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j $(nproc)

The -j $(nproc) flag parallelises the build across all CPU cores. On our 24-core server, this takes about 3-4 minutes.

The compiled binaries land in ./build/bin/ — the two you’ll use most are llama-cli (interactive chat) and llama-server (OpenAI-compatible API).

Step 3: Download a Qwen3 Model

We’re using Qwen3-8B in the Q4_K_M quantisation — a good balance between speed and quality at 4.7GB:

mkdir -p ~/models
curl -L -o ~/models/Qwen3-8B-Q4_K_M.gguf \
  "https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"

Quantisation Guide

Quantisation Size Quality Speed Use Case
Q4_K_M 4.7GB Good Fast General use, recommended
Q5_K_M 5.4GB Better Medium When you need higher accuracy
Q8_0 8.5GB Best Slower Maximum quality on CPU

Step 4: Run Interactive Chat

./build/bin/llama-cli \
  -m ~/models/Qwen3-8B-Q4_K_M.gguf \
  --jinja \
  -co on \
  -t 16 \
  -c 8192 \
  -n 4096 \
  --no-context-shift \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95

Key Parameters Explained

Flag Purpose
-t 16 Use 16 CPU threads (adjust to your core count)
-c 8192 Context window of 8K tokens
-n 4096 Max generation length per response
--jinja Use the model’s built-in chat template
--temp 0.6 Sampling temperature (lower = more deterministic)
--no-context-shift Stop instead of rotating context when full

Once loaded, you’ll see the llama.cpp banner and a > prompt. Type your question and press Enter.

Useful commands inside the chat:

  • /clear — reset chat history
  • /regen — regenerate the last response
  • /exit or Ctrl+C — quit

Step 5: Run as an API Server

For integration with other services, run llama-server which provides an OpenAI-compatible API:

./build/bin/llama-server \
  -m ~/models/Qwen3-8B-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8081 \
  -t 16 \
  -c 8192 \
  --temp 0.6

Then query it like you would OpenAI:

curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3",
    "messages": [{"role": "user", "content": "What is nmap?"}],
    "temperature": 0.6
  }'

Performance on Our Setup

On our 24-core Intel Broadwell Contabo VPS with 117GB RAM, using 16 threads:

  • Model load time: ~3 seconds
  • Prompt processing: ~40 tokens/second
  • Generation speed: ~8-12 tokens/second
  • Memory usage: ~5.5GB resident

This is very usable for interactive chat and internal tooling. For higher throughput, consider the smaller Qwen3-4B model or upgrading to a GPU server.

Tips and Gotchas

  1. Thread count matters: Don’t set -t higher than your physical core count. On our 24-core server, -t 16 gives the best throughput (leaving cores for the OS and other services).

  2. HTTPS downloads: If building llama.cpp with DLLAMA_OPENSSL=ON, you can download models directly with the -hf flag instead of curl. Requires libssl-dev.

  3. Qwen3 thinking mode: Qwen3 models have a “thinking” mode where they reason step-by-step before answering. This is enabled by default via the chat template. To disable it, pass a custom template with --chat-template-file.

  4. Running as a service: Wrap llama-server in a systemd unit or Docker container for production use.

Conclusion

llama.cpp makes it remarkably easy to run state-of-the-art language models on commodity hardware. Qwen3-8B in Q4_K_M quantisation hits a sweet spot of quality and performance for CPU inference. Combined with the OpenAI-compatible server mode, you can integrate local LLM inference into your existing toolchain with minimal friction.


Running this setup on a Contabo VPS with 24 vCPUs and 117GB RAM. Your mileage may vary depending on hardware.

Discussion 0

Markdown supported 0 / 2,000

No comments yet

Be the first to share your perspective