ML Inference Estimator
Estimate inference latency and FPS for ML models on edge devices.
Latency
150 ms
Max FPS
6.7 FPS
Practical Note: Real-world FPS may be lower due to preprocessing overhead and model loading.
Latency Comparison (All Devices)
Model Details
MobileNet V2: Lightweight image classification. Good for real-time inference on edge devices.
What is an ML inference calculator?
An ML inference calculator estimates the computational resources needed to run machine learning model inference โ the process of using a trained model to make predictions on new data. It helps you determine the required GPU memory, compute time, and throughput for deploying models in production, from small classification models to large language models (LLMs).
Model inference requirements depend on model size (number of parameters), precision (FP32, FP16, INT8), batch size, sequence length (for transformers), and target throughput. A 7-billion parameter model in FP16 needs about 14GB of GPU memory just for the weights. This tool calculates memory requirements and estimated throughput for common hardware configurations.
How to use this tool
Enter the model size (number of parameters), precision (data type), batch size, and target hardware. The tool calculates GPU memory required, estimated inference latency, and throughput (tokens per second for language models, or images per second for vision models). It also suggests which GPUs can handle the workload.
Key concepts
- Parameters โ the learned weights in a model. More parameters generally mean better quality but higher resource requirements.
- Precision โ FP32 uses 4 bytes per parameter, FP16/BF16 uses 2 bytes, INT8 uses 1 byte. Lower precision reduces memory and increases speed with minimal quality loss.
- Batch size โ processing multiple inputs simultaneously improves throughput but requires more memory.
- KV cache โ for transformer models, the key-value cache grows with sequence length and consumes significant additional memory.
Common inference hardware
Consumer GPUs: RTX 3090 (24GB), RTX 4090 (24GB) โ good for models up to 13B parameters in INT8. Professional: A100 (40/80GB), H100 (80GB) โ required for larger models. Cloud options: AWS, GCP, and Azure offer GPU instances by the hour. For CPU inference: possible for smaller models but 10-100x slower than GPU. Apple M-series chips offer unified memory that can run surprisingly large models.
Frequently asked questions
How much GPU memory do I need for a 7B parameter model?
At FP16 precision: 7B * 2 bytes = 14GB for weights alone, plus 2-4GB for KV cache and overhead, totaling about 16-18GB. At INT8: 7B * 1 byte = 7GB for weights, about 10-12GB total. At INT4 (GPTQ/AWQ quantization): about 4-5GB for weights, 6-8GB total. A 24GB consumer GPU (RTX 3090/4090) can comfortably run 7B models in most precisions.
What is quantization and how much does it help?
Quantization reduces the precision of model weights from FP32 (4 bytes) to FP16 (2 bytes), INT8 (1 byte), or INT4 (0.5 bytes). This reduces memory requirements proportionally and often speeds up inference. INT8 quantization typically preserves 99%+ of model quality while halving memory compared to FP16. INT4 saves even more memory with slightly more quality degradation. It is the most practical way to run large models on consumer hardware.