llama.cpp qwen3 ai self-hosted llm tutorial

Running Qwen3 Locally with llama.cpp on a CPU Server

March 27, 2026 · 3 min read

0 comments ·

Running Qwen3 Locally with llama.cpp on a CPU Server

You don’t need a GPU to run large language models. In this guide, I’ll walk you through setting up llama.cpp with Qwen3-8B on a CPU-only Linux server — the same setup we use internally at Juaji.

Why llama.cpp?

llama.cpp is a high-performance C++ inference engine for large language models. Key advantages:

No GPU required — runs efficiently on CPU with AVX2/AVX-512 optimisation
Low memory footprint — quantised GGUF models can fit in as little as 4GB RAM
Simple deployment — single binary, no Python dependencies
OpenAI-compatible server — drop-in replacement for API integrations

Prerequisites

Linux x86_64 server (we’re using Ubuntu 22.04 on a Contabo VPS)
16+ GB RAM (our model uses ~5GB)
Build tools: gcc, cmake
~5GB disk space for the model

Step 1: Install Build Tools

sudo apt install -y build-essential cmake

Step 2: Clone and Build llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j $(nproc)

The -j $(nproc) flag parallelises the build across all CPU cores. On our 24-core server, this takes about 3-4 minutes.

The compiled binaries land in ./build/bin/ — the two you’ll use most are llama-cli (interactive chat) and llama-server (OpenAI-compatible API).

Step 3: Download a Qwen3 Model

We’re using Qwen3-8B in the Q4_K_M quantisation — a good balance between speed and quality at 4.7GB:

mkdir -p ~/models
curl -L -o ~/models/Qwen3-8B-Q4_K_M.gguf \
  "https://huggingface.co/Qwen/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"

Quantisation Guide

Quantisation	Size	Quality	Speed	Use Case
Q4_K_M	4.7GB	Good	Fast	General use, recommended
Q5_K_M	5.4GB	Better	Medium	When you need higher accuracy
Q8_0	8.5GB	Best	Slower	Maximum quality on CPU

Step 4: Run Interactive Chat

./build/bin/llama-cli \
  -m ~/models/Qwen3-8B-Q4_K_M.gguf \
  --jinja \
  -co on \
  -t 16 \
  -c 8192 \
  -n 4096 \
  --no-context-shift \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95

Key Parameters Explained

Flag	Purpose
`-t 16`	Use 16 CPU threads (adjust to your core count)
`-c 8192`	Context window of 8K tokens
`-n 4096`	Max generation length per response
`--jinja`	Use the model’s built-in chat template
`--temp 0.6`	Sampling temperature (lower = more deterministic)
`--no-context-shift`	Stop instead of rotating context when full

Once loaded, you’ll see the llama.cpp banner and a > prompt. Type your question and press Enter.

Useful commands inside the chat:

/clear — reset chat history
/regen — regenerate the last response
/exit or Ctrl+C — quit

Step 5: Run as an API Server

For integration with other services, run llama-server which provides an OpenAI-compatible API:

./build/bin/llama-server \
  -m ~/models/Qwen3-8B-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8081 \
  -t 16 \
  -c 8192 \
  --temp 0.6

Then query it like you would OpenAI:

curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3",
    "messages": [{"role": "user", "content": "What is nmap?"}],
    "temperature": 0.6
  }'

Performance on Our Setup

On our 24-core Intel Broadwell Contabo VPS with 117GB RAM, using 16 threads:

Model load time: ~3 seconds
Prompt processing: ~40 tokens/second
Generation speed: ~8-12 tokens/second
Memory usage: ~5.5GB resident

This is very usable for interactive chat and internal tooling. For higher throughput, consider the smaller Qwen3-4B model or upgrading to a GPU server.

Tips and Gotchas

Thread count matters: Don’t set -t higher than your physical core count. On our 24-core server, -t 16 gives the best throughput (leaving cores for the OS and other services).
HTTPS downloads: If building llama.cpp with DLLAMA_OPENSSL=ON, you can download models directly with the -hf flag instead of curl. Requires libssl-dev.
Qwen3 thinking mode: Qwen3 models have a “thinking” mode where they reason step-by-step before answering. This is enabled by default via the chat template. To disable it, pass a custom template with --chat-template-file.
Running as a service: Wrap llama-server in a systemd unit or Docker container for production use.

Conclusion

llama.cpp makes it remarkably easy to run state-of-the-art language models on commodity hardware. Qwen3-8B in Q4_K_M quantisation hits a sweet spot of quality and performance for CPU inference. Combined with the OpenAI-compatible server mode, you can integrate local LLM inference into your existing toolchain with minimal friction.

Running this setup on a Contabo VPS with 24 vCPUs and 117GB RAM. Your mileage may vary depending on hardware.

Discussion 0

No comments yet

Be the first to share your perspective