Build llama.cpp From Source for CPU Inference

Ollama is the easy door into local models. llama.cpp is the engine room underneath a lot of that ecosystem, and building it from source gets you the raw tools: a benchmarking harness, an OpenAI-compatible server, and a CLI, all compiled for your exact CPU. I built it on a GPU-less Debian box, ran the standard benchmark, and hit one gotcha that filled a disk. Here is the whole run.

Tested with: Debian 13 (trixie), Intel Xeon E5-2670 (6 cores), 15 GB RAM, no GPU, llama.cpp built from the latest git with CMake, model Llama-3.2-1B-Instruct-Q4_K_M.gguf.

Install the build dependencies

llama.cpp is C++ and builds with CMake. On a fresh Debian image you need the compiler toolchain, git, and cmake:

sudo apt-get update
sudo apt-get install -y git build-essential cmake

Clone and build

The project moved to the ggml-org organization. A shallow clone is plenty unless you want history:

git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp

The build is two CMake commands: configure, then compile. The -j6 matches my six cores, so use your own core count:

cmake -B build
cmake --build build --config Release -j6

This is the part that takes a while. On this old Xeon the compile ran for several minutes and pegged all six cores. When it finishes, your binaries are in build/bin:

$ ls build/bin | grep -E '^llama-(cli|bench|server)$'
llama-bench
llama-cli
llama-server

Three tools worth knowing: llama-cli for interactive and scripted runs, llama-server for an HTTP API, and llama-bench for honest performance numbers.

Get a model in GGUF format

llama.cpp loads GGUF files, not the formats you get from Ollama. Grab a small quantized model. I used a 1B Llama 3.2 at Q4_K_M, which is a good speed-to-quality balance on CPU:

mkdir -p ~/models
wget -O ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf

That file is about 763 MiB. "Q4_K_M" means 4-bit K-quant, medium; it is the quantization most people start with because it keeps most of the quality at a quarter of the size.

Benchmark it properly

Do not eyeball speed from a chat. llama-bench exists for exactly this and gives you repeatable numbers:

./build/bin/llama-bench -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf

The result on this CPU:

| model                  |       size | params | backend | threads |  test |          t/s |
| ---------------------- | ---------: | -----: | ------- | ------: | ----: | -----------: |
| llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | CPU     |       6 | pp512 | 24.67 ± 5.90 |

The pp512 row is prompt processing: how fast llama.cpp ingests a 512-token prompt, here about 24.7 tokens/s. The other half of the story is token generation, which is always slower than prompt processing on CPU. For a sense of scale, in the companion Ollama test on this same box a 3B model generated at roughly 3 tokens/s. A 1B model is quicker, but the lesson holds: on CPU, reading the prompt is cheap, writing the answer is the bottleneck.

The gotcha that filled my disk

Here is the one that cost me. llama-cli defaults to interactive conversation mode for instruct-tuned models. I tried to run a single non-interactive generation and redirected stdin from /dev/null to keep it from waiting for input. Bad idea: with stdin closed, the CLI hit end-of-file instantly and looped, printing an empty > prompt as fast as it could. I came back to a log file that had grown to 4.8 GB.

$ wc -c llamacli.log
4848707335 llamacli.log   # 4.8 GB of empty prompts

If you want a true one-shot from llama-cli, disable conversation mode explicitly rather than starving stdin:

./build/bin/llama-cli \
  -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  -p "Explain quantization in one sentence." \
  -n 120 -no-cnv

And honestly, for anything beyond a quick poke, skip the CLI and run the server. It is the more predictable interface.

Run the OpenAI-compatible server

llama-server turns your model into an HTTP endpoint that speaks the OpenAI API, which means existing tools and SDKs can talk to it unchanged:

./build/bin/llama-server \
  -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080

Then call it like any chat completions API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
  }'

This is the real payoff of building from source. You get a self-hosted, OpenAI-shaped endpoint backed by a model file you control, compiled for your hardware, with no daemon and no account.

llama.cpp or Ollama?

Reach for Ollama when you want the easiest possible local model: it manages downloads, runs as a service, and "just works". Reach for llama.cpp when you want control: specific quantizations, the benchmark harness, build flags tuned to your CPU, or an OpenAI-compatible server without the extra layer. Many people run both. Ollama for daily driving, llama.cpp when they need to measure or tune something.

Takeaways

Building llama.cpp is two CMake commands once you have build-essential and cmake installed.
The useful binaries land in build/bin: llama-cli, llama-server, and llama-bench.
Use llama-bench for real numbers. On this 6-core Xeon a 1B Q4_K_M model processed prompts at about 24.7 tokens/s; generation is the slower half on CPU.
llama-cli is interactive by default. Do not close its stdin to make it non-interactive; pass -no-cnv instead, or you may generate a multi-gigabyte log of empty prompts.
llama-server gives you an OpenAI-compatible endpoint, which is the cleanest way to actually use your build.