Running Qwen3 Locally with llama.cpp on a CPU Server
You don’t need a GPU to run large language models. In this guide, I’ll walk you through setting up llama.cpp with Qwen3-8B on a CPU-only Linux server — the same setup we use internally at Juaji.
Why llama.cpp?
llama.cpp is a high-performance C++ inference engine for large language models. Key advantages:
- No GPU required — runs efficiently on CPU with AVX2/AVX-512 optimisation
- Low memory footprint — quantised GGUF models can fit in as little as 4GB RAM
- Simple deployment — single binary, no Python dependencies
- OpenAI-compatible server — drop-in replacement for API integrations
Prerequisites
- Linux x86_64 server (we’re using Ubuntu 22.04 on a Contabo VPS)
- 16+ GB RAM (our model uses ~5GB)
- Build tools:
gcc,cmake - ~5GB disk space for the model
Step 1: Install Build Tools
sudo apt install -y build-essential cmake
Step 2: Clone and Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j $(nproc)
The -j $(nproc) flag parallelises the build across all CPU cores. On our 24-core server, this takes about 3-4 minutes.
The compiled binaries land in ./build/bin/ — the two you’ll use most are llama-cli (interactive chat) and llama-server (OpenAI-compatible API).
Step 3: Download a Qwen3 Model
We’re using Qwen3-8B in the Q4_K_M quantisation — a good balance between speed and quality at 4.7GB:
mkdir -p ~/models
curl -L -o ~/models/Qwen3-8B-Q4_K_M.gguf \
"https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"
Quantisation Guide
| Quantisation | Size | Quality | Speed | Use Case |
|---|---|---|---|---|
| Q4_K_M | 4.7GB | Good | Fast | General use, recommended |
| Q5_K_M | 5.4GB | Better | Medium | When you need higher accuracy |
| Q8_0 | 8.5GB | Best | Slower | Maximum quality on CPU |
Step 4: Run Interactive Chat
./build/bin/llama-cli \
-m ~/models/Qwen3-8B-Q4_K_M.gguf \
--jinja \
-co on \
-t 16 \
-c 8192 \
-n 4096 \
--no-context-shift \
--temp 0.6 \
--top-k 20 \
--top-p 0.95
Key Parameters Explained
| Flag | Purpose |
|---|---|
-t 16 |
Use 16 CPU threads (adjust to your core count) |
-c 8192 |
Context window of 8K tokens |
-n 4096 |
Max generation length per response |
--jinja |
Use the model’s built-in chat template |
--temp 0.6 |
Sampling temperature (lower = more deterministic) |
--no-context-shift |
Stop instead of rotating context when full |
Once loaded, you’ll see the llama.cpp banner and a > prompt. Type your question and press Enter.
Useful commands inside the chat:
/clear— reset chat history/regen— regenerate the last response/exitorCtrl+C— quit
Step 5: Run as an API Server
For integration with other services, run llama-server which provides an OpenAI-compatible API:
./build/bin/llama-server \
-m ~/models/Qwen3-8B-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8081 \
-t 16 \
-c 8192 \
--temp 0.6
Then query it like you would OpenAI:
curl http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3",
"messages": [{"role": "user", "content": "What is nmap?"}],
"temperature": 0.6
}'
Performance on Our Setup
On our 24-core Intel Broadwell Contabo VPS with 117GB RAM, using 16 threads:
- Model load time: ~3 seconds
- Prompt processing: ~40 tokens/second
- Generation speed: ~8-12 tokens/second
- Memory usage: ~5.5GB resident
This is very usable for interactive chat and internal tooling. For higher throughput, consider the smaller Qwen3-4B model or upgrading to a GPU server.
Tips and Gotchas
-
Thread count matters: Don’t set
-thigher than your physical core count. On our 24-core server,-t 16gives the best throughput (leaving cores for the OS and other services). -
HTTPS downloads: If building llama.cpp with
DLLAMA_OPENSSL=ON, you can download models directly with the-hfflag instead of curl. Requireslibssl-dev. -
Qwen3 thinking mode: Qwen3 models have a “thinking” mode where they reason step-by-step before answering. This is enabled by default via the chat template. To disable it, pass a custom template with
--chat-template-file. -
Running as a service: Wrap
llama-serverin a systemd unit or Docker container for production use.
Conclusion
llama.cpp makes it remarkably easy to run state-of-the-art language models on commodity hardware. Qwen3-8B in Q4_K_M quantisation hits a sweet spot of quality and performance for CPU inference. Combined with the OpenAI-compatible server mode, you can integrate local LLM inference into your existing toolchain with minimal friction.
Running this setup on a Contabo VPS with 24 vCPUs and 117GB RAM. Your mileage may vary depending on hardware.
Discussion 0
No comments yet
Be the first to share your perspective