I wanted a local LLM running on a spare box with no GPU, just to see how usable CPU-only inference actually is in 2026. Ollama is the fastest way to find out: one install script, one pull, and you are chatting with a model that never leaves your machine. Here is the exact run, including the speed numbers you should expect on older hardware and the one thing that tripped me up before I even started.
Tested with: Debian 13 (trixie), Intel Xeon E5-2670 (6 cores), 15 GB RAM, no GPU, Ollama 0.24.0, model llama3.2:3b.
The gotcha before the gotcha: no curl
The official install is a piped shell script that starts with curl. On this minimal Debian image, curl was not installed, so the very first command failed:
$ curl -fsSL https://ollama.com/install.sh | sh
bash: curl: command not foundIf you are on a fresh server image, install the basics first. You will want git and build tools later anyway:
sudo apt-get update
sudo apt-get install -y curl git ca-certificatesInstall Ollama
With curl present, the install is a single line:
curl -fsSL https://ollama.com/install.sh | shThe script does more than drop a binary. It creates a dedicated ollama user, registers a systemd service, and starts it. The tail of the output is the part worth reading:
>>> Creating ollama user...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.That last line is not an error. It is Ollama telling you it will use the CPU, which is exactly what we are testing. Confirm the version and that the service is up:
$ ollama --version
ollama version is 0.24.0
$ systemctl is-active ollama
activeBecause the installer runs Ollama as a background service listening on 127.0.0.1:11434, you do not need to start anything yourself. The ollama CLI is just a client that talks to that local API.
Pull a small model
On CPU, model size is everything. A 3-billion-parameter model is a sensible starting point: big enough to be coherent, small enough to load fast. I pulled Llama 3.2 3B:
$ ollama pull llama3.2:3b
...
success
$ ollama list
NAME ID SIZE MODIFIED
llama3.2:3b a80c4f17acd5 2.0 GB 10 seconds agoTwo gigabytes on disk, which is the quantized weight file. Ollama stores models under /usr/share/ollama/.ollama/models when it runs as the service user.
Run it, and measure
The --verbose flag is the one to remember. It prints timing stats after every response, which is how you turn "feels slow" into actual numbers:
$ ollama run llama3.2:3b "Write a short paragraph explaining what an embedding is in machine learning." --verboseThe model answered cleanly:
In machine learning, an embedding is a mathematical representation of a high-dimensional input space into a lower-dimensional, more compact and meaningful space. It maps each unique piece of data (such as words, images, or objects) to a dense vector in the new space, where similar inputs are mapped closer together. This allows for efficient storage and processing of complex data sets while preserving the underlying structure and relationships between the original inputs.
And then the numbers that actually matter:
total duration: 45.041526157s
load duration: 742.128419ms
prompt eval count: 38 token(s)
prompt eval duration: 1.997157037s
prompt eval rate: 19.03 tokens/s
eval count: 133 token(s)
eval duration: 41.917526794s
eval rate: 3.17 tokens/sWhat those numbers tell you
Read the two rates separately. Prompt eval rate (19.03 tokens/s) is how fast the model reads your input. Eval rate (3.17 tokens/s) is how fast it writes the response. On this old Xeon, generation runs at roughly three tokens per second, which is about three words every two seconds. It is readable in real time, like watching someone type, but you would not want to generate a long document with it.
That is the honest reality of CPU inference on aging server silicon: fine for short questions, classification, and tinkering; painful for anything long-form. Newer CPUs with AVX-512 and faster memory do meaningfully better, and a 1B model roughly doubles the generation speed. But if you care about throughput, this is the experiment that convinces you to budget for a GPU.
Talking to the API directly
The CLI is convenient, but the same service answers HTTP, which is what you wire into scripts and apps:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Say hello in one sentence.",
"stream": false
}'That endpoint is also why tools like Open WebUI can put a chat interface in front of Ollama with almost no configuration. Point them at 11434 and they just work.
Takeaways
- Ollama is genuinely one command to install and one to pull a model, but check for
curlfirst on minimal server images. - The installer runs Ollama as a systemd service on
127.0.0.1:11434, so there is nothing to start manually. - Always run with
--verbosewhile you are evaluating hardware. Theeval rateline is your real benchmark. - On an old 6-core Xeon with no GPU, expect around 3 tokens/s generating with a 3B model. Usable for short tasks, slow for long ones.
- Start small (3B or even 1B) on CPU, and reach for a GPU only once the token rate is the thing holding you back.