Local AI on a Mac: what fits in 16 GB without lying to yourself

A practical map of which local AI workflows are real and which only look real, on a Mac with 16 GB of unified memory.

Published April 18, 2026 · 3 min read

AI
Local AI
macOS
Ollama
MLX

A 16 GB MacBook is not a workstation. It is a phone with delusions of grandeur. And yet a meaningful chunk of “local AI” is genuinely useful on one — if you stop pretending you can run frontier models offline.

This is the map I keep handy. None of it is theoretical; all of it is what I actually run on my own machine.

What 16 GB really gives you

Of those 16 GB, macOS and Chrome and Slack happily eat 6–8 GB before you open a terminal. So your useful “model budget” is around 8 GB of memory. Anything above that and your machine starts swapping, and Apple Silicon swaps quietly until it does not.

Useful budget translates to:

Up to ~7 B parameter models at 4-bit quantisation (Q4_K_M GGUF, MLX 4-bit). Comfortable. Snappy.
8 B at 4-bit if you close everything else. Possible. Slightly tense.
13 B at 4-bit only if you really close everything. Not your daily driver.
Anything ≥ 30 B: API. Always. Stop kidding yourself.

Two stacks worth knowing

Ollama is the one I tell people to install first. It is brew install ollama, then ollama run llama3.1:8b. It picks the right quantisation for you, handles model storage, exposes a localhost API, and never asks you to think about CUDA. The cost: it is slightly slower than the per-platform optimum, because it abstracts the substrate. For 90 % of “I want a local model to draft this thing” use cases, that is fine.

MLX is Apple’s machine-learning framework. It is faster on Apple Silicon than anything else, because it speaks Metal natively and uses the unified memory like it should be used. The cost: more setup, fewer models, more rough edges. Worth it when the model is your bottleneck (real inference workloads, fine-tuning small models locally) — overkill when it is your hands.

Default to Ollama; reach for MLX when the wattmeter is the metric.

Three workflows that actually work on 16 GB

Drafting and rewriting. A 7B local model is excellent at “make this paragraph less corporate” or “summarise these 800 lines of changelog into release notes”. Quality is good enough; latency is human; nothing leaves the machine.
Local code-aware retrieval. A 7B model + a small embedding model (nomic-embed-text or bge-small) running locally is enough to do “given my notes folder, what did I write about X”. Pair it with a tiny custom CLI; do not bother with a vector database, SQLite + cosine similarity is enough at this scale.
Privacy-sensitive routing. Use a local model as the first model. If the user asks for “rewrite this paragraph”, local handles it. If the user asks for “design this database”, route to Claude. The local model is a router and a privacy boundary, not a replacement.

Three workflows that quietly do not work

“Local agent that does my day job”. A 7B model is not Claude. The agent loop falls apart because the model’s tool-use is mediocre and its context window is small. Do not chase this. Use a hosted model with a long context window.
Long-context summarisation of really long documents. A local 8B with 32k context will technically run, but quality degrades fast past 8k tokens. For 200-page PDFs, hosted.
Image/video generation (anything past basic SDXL). Possible, but the result-per-watt is so much worse than a hosted endpoint that it stops being a hobby and starts being a heater.

The honest take: 16 GB is enough to make local AI a calm, private, fast layer of your day — and a poor substitute for a frontier model. Use it for the things it does well (drafts, routing, retrieval), pay for the rest, and stop reading benchmarks of 70B models you cannot actually run.

Was this useful?