Step 1 — Your model & text sizes
Step 2 — Your computer
Advanced tuning (optional)
Results — Minimal system & Recommended system
Section 1 — Minimal hardware by quantization
Section 2 — Recommended system
Why these numbers?
References & assumptions
- GPU capacities: 4070 SUPER 12GB; 4080 SUPER 16GB; 4090 24GB; RTX 5000/6000 Ada 32/48GB; L40S 48GB; A100 80GB; H100 80GB; H200 141GB; MI300X 192GB; RX 7900 XTX 24GB.
- Quantization impact: INT8 ~0–1% from FP16; Q6/Q5 ~1–3%; Q4 ~3–6%; Q3 ~6–12% (typical).
- KV cache ≈ 2 × layers × hidden × bytes/elem × tokens × batch. FP16=2B, FP8=1B.
Contact & links
FAQ — quick answers
Can I run a 7B model on CPU?
Yes, but it can be slow. With small prompts and Q5/Q4 quantization, 32–64 GB RAM is often enough. For snappy chat, a mid-range GPU is recommended.
What is quantization?
It stores weights in fewer bits (e.g., Q5, Q4) to reduce memory with a small accuracy trade-off. Our “Auto” pick chooses the highest quality that fits your hardware headroom.
Do I need a GPU?
Not always. For larger models, long prompts, or faster replies, a GPU helps a lot. The calculator tells you when CPU-only isn’t practical and suggests the smallest suitable GPU.
Why doesn’t it fit my GPU?
We include model weights, KV cache, +20% runtime overhead, and +3 GB for the OS, then keep ≤85% of VRAM for stability. Try a smaller batch, fewer tokens, or a lower-bit quantization.