Ollama is the easy door into local models. llama.cpp is the engine room underneath a lot of that ecosystem, and building it from source gets you the raw tools: a benchmarking harness, an OpenAI-compatible server, and a CLI, all compiled for your exact CPU. I built it on a GPU-less Debian box, ran the standard benchmark, and hit one gotcha that filled a disk. Here is the whole run.
Tested with: Debian 13 (trixie), Intel Xeon E5-2670 (6 cores), 15 GB RAM, no GPU, llama.cpp built from the latest git with CMake, model Llama-3.2-1B-Instruct-Q4_K_M.gguf.
Install the build dependencies
llama.cpp is C++ and builds with CMake. On a fresh Debian image you need the compiler toolchain, git, and cmake:
sudo apt-get update
sudo apt-get install -y git build-essential cmakeClone and build
The project moved to the ggml-org organization. A shallow clone is plenty unless you want history:
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cppThe build is two CMake commands: configure, then compile. The -j6 matches my six cores, so use your own core count:
cmake -B build
cmake --build build --config Release -j6This is the part that takes a while. On this old Xeon the compile ran for several minutes and pegged all six cores. When it finishes, your binaries are in build/bin:
$ ls build/bin | grep -E '^llama-(cli|bench|server)$'
llama-bench
llama-cli
llama-serverThree tools worth knowing: llama-cli for interactive and scripted runs, llama-server for an HTTP API, and llama-bench for honest performance numbers.
Get a model in GGUF format
llama.cpp loads GGUF files, not the formats you get from Ollama. Grab a small quantized model. I used a 1B Llama 3.2 at Q4_K_M, which is a good speed-to-quality balance on CPU:
mkdir -p ~/models
wget -O ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.ggufThat file is about 763 MiB. "Q4_K_M" means 4-bit K-quant, medium; it is the quantization most people start with because it keeps most of the quality at a quarter of the size.
Benchmark it properly
Do not eyeball speed from a chat. llama-bench exists for exactly this and gives you repeatable numbers:
./build/bin/llama-bench -m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.ggufThe result on this CPU:
| model | size | params | backend | threads | test | t/s |
| ---------------------- | ---------: | -----: | ------- | ------: | ----: | -----------: |
| llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | CPU | 6 | pp512 | 24.67 ± 5.90 |The pp512 row is prompt processing: how fast llama.cpp ingests a 512-token prompt, here about 24.7 tokens/s. The other half of the story is token generation, which is always slower than prompt processing on CPU. For a sense of scale, in the companion Ollama test on this same box a 3B model generated at roughly 3 tokens/s. A 1B model is quicker, but the lesson holds: on CPU, reading the prompt is cheap, writing the answer is the bottleneck.
The gotcha that filled my disk
Here is the one that cost me. llama-cli defaults to interactive conversation mode for instruct-tuned models. I tried to run a single non-interactive generation and redirected stdin from /dev/null to keep it from waiting for input. Bad idea: with stdin closed, the CLI hit end-of-file instantly and looped, printing an empty > prompt as fast as it could. I came back to a log file that had grown to 4.8 GB.
$ wc -c llamacli.log
4848707335 llamacli.log # 4.8 GB of empty promptsIf you want a true one-shot from llama-cli, disable conversation mode explicitly rather than starving stdin:
./build/bin/llama-cli \
-m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
-p "Explain quantization in one sentence." \
-n 120 -no-cnvAnd honestly, for anything beyond a quick poke, skip the CLI and run the server. It is the more predictable interface.
Run the OpenAI-compatible server
llama-server turns your model into an HTTP endpoint that speaks the OpenAI API, which means existing tools and SDKs can talk to it unchanged:
./build/bin/llama-server \
-m ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080Then call it like any chat completions API:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'This is the real payoff of building from source. You get a self-hosted, OpenAI-shaped endpoint backed by a model file you control, compiled for your hardware, with no daemon and no account.
llama.cpp or Ollama?
Reach for Ollama when you want the easiest possible local model: it manages downloads, runs as a service, and "just works". Reach for llama.cpp when you want control: specific quantizations, the benchmark harness, build flags tuned to your CPU, or an OpenAI-compatible server without the extra layer. Many people run both. Ollama for daily driving, llama.cpp when they need to measure or tune something.
Takeaways
- Building llama.cpp is two CMake commands once you have
build-essentialandcmakeinstalled. - The useful binaries land in
build/bin:llama-cli,llama-server, andllama-bench. - Use
llama-benchfor real numbers. On this 6-core Xeon a 1B Q4_K_M model processed prompts at about 24.7 tokens/s; generation is the slower half on CPU. llama-cliis interactive by default. Do not close its stdin to make it non-interactive; pass-no-cnvinstead, or you may generate a multi-gigabyte log of empty prompts.llama-servergives you an OpenAI-compatible endpoint, which is the cleanest way to actually use your build.