Jun 4, 2026 4 min read

Build llama.cpp From Source for CPU Inference

Build llama.cpp on a CPU-only Debian box, run llama-bench for real numbers, and serve an OpenAI-compatible endpoint. With a 4.8GB gotcha.

Ollama is the easy door into local models. llama.cpp is the engine room underneath a lot of that ecosystem, and building it from source gets you the raw tools: a benchmarking harness, an OpenAI-compatible server, and a CLI, all compiled for your exact CPU. I built it on a GPU-less Debian box, ran the standard benchmark, and hit one gotcha that filled a disk. Here is the whole run.

Tested with: Debian 13 (trixie), Intel Xeon E5-2670 (6 cores), 15 GB RAM, no GPU, llama.cpp built from the latest git with CMake, model Llama-3.2-1B-Instruct-Q4_K_M.gguf.

Install the build dependencies

llama.cpp is C++ and builds with CMake. On a fresh Debian image you need the compiler toolchain, git, and cmake:

sudo apt-get update
sudo apt-get install -y git build-essential cmake

Clone and build

The project moved to the ggml-org organization. A shallow clone is plenty unless you want history:

git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp

The build is two CMake commands: configure, then compile. The -j6 matches my six cores, so use your own core count:

cmake -B build
cmake --build build --config Release -j6

This is the part that takes a while. On this old Xeon the compile ran for several minutes and pegged all six cores. When it finishes, your binaries are in build/bin:

$ ls build/bin | grep -E '^llama-(cli|bench|server)$'
llama-bench
llama-cli
llama-server

Three tools worth knowing: llama-cli for interactive and scripted runs, llama-server for an HTTP API, and llama-bench for honest performance numbers.

Get a model in GGUF format

llama.cpp loads GGUF files, not the formats you get from Ollama. Grab a small quantized model. I used a 1B Llama 3.2 at Q4_K_M, which is a good speed-to-quality balance on CPU:

mkdir -p ~/models
wget -O ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf

That file is about 763 MiB. "Q4_K_M" means 4-bit K-quant, medium; it is the quantization most people start with because it keeps most of the quality at a quarter of the size.

Benchmark it properly

Do not eyeball speed from a chat. llama-bench exists for exactly this and gives you repeatable numbers:

./build/bin/llama-bench -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf

The result on this CPU:

| model                  |       size | params | backend | threads |  test |          t/s |
| ---------------------- | ---------: | -----: | ------- | ------: | ----: | -----------: |
| llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | CPU     |       6 | pp512 | 24.67 ± 5.90 |

The pp512 row is prompt processing: how fast llama.cpp ingests a 512-token prompt, here about 24.7 tokens/s. The other half of the story is token generation, which is always slower than prompt processing on CPU. For a sense of scale, in the companion Ollama test on this same box a 3B model generated at roughly 3 tokens/s. A 1B model is quicker, but the lesson holds: on CPU, reading the prompt is cheap, writing the answer is the bottleneck.

The gotcha that filled my disk

Here is the one that cost me. llama-cli defaults to interactive conversation mode for instruct-tuned models. I tried to run a single non-interactive generation and redirected stdin from /dev/null to keep it from waiting for input. Bad idea: with stdin closed, the CLI hit end-of-file instantly and looped, printing an empty > prompt as fast as it could. I came back to a log file that had grown to 4.8 GB.

$ wc -c llamacli.log
4848707335 llamacli.log   # 4.8 GB of empty prompts

If you want a true one-shot from llama-cli, disable conversation mode explicitly rather than starving stdin:

./build/bin/llama-cli \
  -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  -p "Explain quantization in one sentence." \
  -n 120 -no-cnv

And honestly, for anything beyond a quick poke, skip the CLI and run the server. It is the more predictable interface.

Run the OpenAI-compatible server

llama-server turns your model into an HTTP endpoint that speaks the OpenAI API, which means existing tools and SDKs can talk to it unchanged:

./build/bin/llama-server \
  -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080

Then call it like any chat completions API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
  }'

This is the real payoff of building from source. You get a self-hosted, OpenAI-shaped endpoint backed by a model file you control, compiled for your hardware, with no daemon and no account.

llama.cpp or Ollama?

Reach for Ollama when you want the easiest possible local model: it manages downloads, runs as a service, and "just works". Reach for llama.cpp when you want control: specific quantizations, the benchmark harness, build flags tuned to your CPU, or an OpenAI-compatible server without the extra layer. Many people run both. Ollama for daily driving, llama.cpp when they need to measure or tune something.

Takeaways

  • Building llama.cpp is two CMake commands once you have build-essential and cmake installed.
  • The useful binaries land in build/bin: llama-cli, llama-server, and llama-bench.
  • Use llama-bench for real numbers. On this 6-core Xeon a 1B Q4_K_M model processed prompts at about 24.7 tokens/s; generation is the slower half on CPU.
  • llama-cli is interactive by default. Do not close its stdin to make it non-interactive; pass -no-cnv instead, or you may generate a multi-gigabyte log of empty prompts.
  • llama-server gives you an OpenAI-compatible endpoint, which is the cleanest way to actually use your build.
J
Great! You’ve successfully signed up.
Welcome back! You've successfully signed in.
You've successfully subscribed to LLMbits.
Your link has expired.
Success! Check your email for magic link to sign-in.
Success! Your billing info has been updated.
Your billing was not updated.