LLM Hardware Requirements Calculator (Beginner-Friendly)

Step 1 — Your model & text sizes

Choose a model *

Pick a model or choose Custom if it’s not listed.

Prompt tokens *i

We default to 4096 if blank.

Generated tokens *i

We default to 1024 if blank.

Requests at once (batch) *i

We default to 1 if blank.

Quality vs memoryi

We pick the right quantization for you.

Step 2 — Your computer

Do you have a GPU? *i

If CPU-only can’t handle it, we’ll clearly suggest a GPU.

My system typei

Sets a safe performance baseline; you can override RAM/VRAM next.

System RAM (GB)i

We leave ~10% spare and +3 GB for the OS.

Show timings (optional)i

Turn on to see estimated time to process.

Advanced tuning (optional)

Use these if you want finer control. Beginners can skip.

Quantizations to comparei

Toggle the sets you want. Optional: lock a quant for recommendation.

Lock recommended quant (optional)

KV cache dtypei

Backend/runtimei

Interconnecti

Prefill speed multiplieri

Batch efficiency βi

Hybrid offload %i

Baseline decode t/s (7B FP16 b=1)i

Results — Minimal system & Recommended system

Headroom: GPU ≤85% VRAM, CPU ≤90% RAM; includes +20% runtime overhead and +3 GB OS.

Section 1 — Minimal hardware by quantization

“CPU-only works?” uses your RAM. “Fits your GPU?” uses your GPU’s VRAM. “Smallest GPU” is a suggestion, not a requirement if CPU-only works.

Section 2 — Recommended system

Why these numbers?

We count weights + KV cache, add 20% runtime overhead and +3 GB for the OS. For recommended: GPU fits ≤85% of VRAM; CPU fits ≤90% of RAM. Timings use a safe baseline from your System type; adjust in Advanced for more accuracy.

References & assumptions

GPU capacities: 4070 SUPER 12GB; 4080 SUPER 16GB; 4090 24GB; RTX 5000/6000 Ada 32/48GB; L40S 48GB; A100 80GB; H100 80GB; H200 141GB; MI300X 192GB; RX 7900 XTX 24GB.
Quantization impact: INT8 ~0–1% from FP16; Q6/Q5 ~1–3%; Q4 ~3–6%; Q3 ~6–12% (typical).
KV cache ≈ 2 × layers × hidden × bytes/elem × tokens × batch. FP16=2B, FP8=1B.

Contact & links

[email protected]

Prefer another contact method? Add it here later—this section is easy to extend.

FAQ — quick answers

Can I run a 7B model on CPU?

Yes, but it can be slow. With small prompts and Q5/Q4 quantization, 32–64 GB RAM is often enough. For snappy chat, a mid-range GPU is recommended.

What is quantization?

It stores weights in fewer bits (e.g., Q5, Q4) to reduce memory with a small accuracy trade-off. Our “Auto” pick chooses the highest quality that fits your hardware headroom.

Do I need a GPU?

Not always. For larger models, long prompts, or faster replies, a GPU helps a lot. The calculator tells you when CPU-only isn’t practical and suggests the smallest suitable GPU.

Why doesn’t it fit my GPU?

We include model weights, KV cache, +20% runtime overhead, and +3 GB for the OS, then keep ≤85% of VRAM for stability. Try a smaller batch, fewer tokens, or a lower-bit quantization.