LLM GPU RAM Calculator

Estimate the GPU RAM required to load large language model weights from parameter count and precision.

Model size (billion parameters)

Precision

How to Use

Enter model size

Provide parameter count in billions (for example 7, 70, or 405). This is the size advertised for the base model weights.

Choose precision

FP32, FP16/BF16, INT8, and INT4 change how many bytes each parameter uses. Quantized checkpoints use fewer bytes per weight.

Read the estimate

The calculator shows approximate GPU memory to hold weights. Serving traffic adds KV cache, activations, and framework overhead.

Plan deployment

Add headroom for batching, long context, and optimizer states if you fine-tune. Multi-GPU tensor parallelism splits memory across devices.

How we estimate VRAM

This tool multiplies billions of parameters by bytes per parameter for the precision you select, then converts to gigabytes with a small fixed overhead factor for alignment and framework buffers around the weight tensor itself.

Real deployments also need memory for attention caches, activations, and CUDA kernels. Treat the number as a floor for inference and add margin—especially for long context windows or large batches.

FAQ

Have more questions? Contact us