Jun 1, 2026 3 min read

Run a Local LLM with Ollama on Debian (CPU-Only)

Install Ollama, pull a model, and benchmark CPU-only inference with real tokens/s numbers from an old 6-core Xeon.

I wanted a local LLM running on a spare box with no GPU, just to see how usable CPU-only inference actually is in 2026. Ollama is the fastest way to find out: one install script, one pull, and you are chatting with a model that never leaves your machine. Here is the exact run, including the speed numbers you should expect on older hardware and the one thing that tripped me up before I even started.

Tested with: Debian 13 (trixie), Intel Xeon E5-2670 (6 cores), 15 GB RAM, no GPU, Ollama 0.24.0, model llama3.2:3b.

The gotcha before the gotcha: no curl

The official install is a piped shell script that starts with curl. On this minimal Debian image, curl was not installed, so the very first command failed:

$ curl -fsSL https://ollama.com/install.sh | sh
bash: curl: command not found

If you are on a fresh server image, install the basics first. You will want git and build tools later anyway:

sudo apt-get update
sudo apt-get install -y curl git ca-certificates

Install Ollama

With curl present, the install is a single line:

curl -fsSL https://ollama.com/install.sh | sh

The script does more than drop a binary. It creates a dedicated ollama user, registers a systemd service, and starts it. The tail of the output is the part worth reading:

>>> Creating ollama user...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.

That last line is not an error. It is Ollama telling you it will use the CPU, which is exactly what we are testing. Confirm the version and that the service is up:

$ ollama --version
ollama version is 0.24.0

$ systemctl is-active ollama
active

Because the installer runs Ollama as a background service listening on 127.0.0.1:11434, you do not need to start anything yourself. The ollama CLI is just a client that talks to that local API.

Pull a small model

On CPU, model size is everything. A 3-billion-parameter model is a sensible starting point: big enough to be coherent, small enough to load fast. I pulled Llama 3.2 3B:

$ ollama pull llama3.2:3b
...
success

$ ollama list
NAME           ID              SIZE      MODIFIED
llama3.2:3b    a80c4f17acd5    2.0 GB    10 seconds ago

Two gigabytes on disk, which is the quantized weight file. Ollama stores models under /usr/share/ollama/.ollama/models when it runs as the service user.

Run it, and measure

The --verbose flag is the one to remember. It prints timing stats after every response, which is how you turn "feels slow" into actual numbers:

$ ollama run llama3.2:3b "Write a short paragraph explaining what an embedding is in machine learning." --verbose

The model answered cleanly:

In machine learning, an embedding is a mathematical representation of a high-dimensional input space into a lower-dimensional, more compact and meaningful space. It maps each unique piece of data (such as words, images, or objects) to a dense vector in the new space, where similar inputs are mapped closer together. This allows for efficient storage and processing of complex data sets while preserving the underlying structure and relationships between the original inputs.

And then the numbers that actually matter:

total duration:       45.041526157s
load duration:        742.128419ms
prompt eval count:    38 token(s)
prompt eval duration: 1.997157037s
prompt eval rate:     19.03 tokens/s
eval count:           133 token(s)
eval duration:        41.917526794s
eval rate:            3.17 tokens/s

What those numbers tell you

Read the two rates separately. Prompt eval rate (19.03 tokens/s) is how fast the model reads your input. Eval rate (3.17 tokens/s) is how fast it writes the response. On this old Xeon, generation runs at roughly three tokens per second, which is about three words every two seconds. It is readable in real time, like watching someone type, but you would not want to generate a long document with it.

That is the honest reality of CPU inference on aging server silicon: fine for short questions, classification, and tinkering; painful for anything long-form. Newer CPUs with AVX-512 and faster memory do meaningfully better, and a 1B model roughly doubles the generation speed. But if you care about throughput, this is the experiment that convinces you to budget for a GPU.

Talking to the API directly

The CLI is convenient, but the same service answers HTTP, which is what you wire into scripts and apps:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Say hello in one sentence.",
  "stream": false
}'

That endpoint is also why tools like Open WebUI can put a chat interface in front of Ollama with almost no configuration. Point them at 11434 and they just work.

Takeaways

  • Ollama is genuinely one command to install and one to pull a model, but check for curl first on minimal server images.
  • The installer runs Ollama as a systemd service on 127.0.0.1:11434, so there is nothing to start manually.
  • Always run with --verbose while you are evaluating hardware. The eval rate line is your real benchmark.
  • On an old 6-core Xeon with no GPU, expect around 3 tokens/s generating with a 3B model. Usable for short tasks, slow for long ones.
  • Start small (3B or even 1B) on CPU, and reach for a GPU only once the token rate is the thing holding you back.
J
Great! You’ve successfully signed up.
Welcome back! You've successfully signed in.
You've successfully subscribed to LLMbits.
Your link has expired.
Success! Check your email for magic link to sign-in.
Success! Your billing info has been updated.
Your billing was not updated.