Best GPUs for LLM Training and Inference
The most important factor when selecting a GPU for large language models is VRAM capacity. A 7B parameter model requires approximately 14GB of memory for inference in FP16, a 13B model requires approximately 26GB, and a 70B model requires approximately 140GB. Training the same models requires 2-3x more memory due to optimizer states, gradients, and activation checkpoints.
Price per GB of VRAM is the most useful comparison metric for LLM workloads. Multi-GPU configurations add complexity: training a model larger than what fits in a single GPU requires NVLink or InfiniBand interconnects. Not all cloud providers offer NVLink-connected GPU pairs. GpuPerHour tracks NVLink availability across providers so users can filter for multi-GPU-ready instances.
Recommended GPUs
H100 SXM5
The H100 SXM is the standard for serious LLM work. Its 80GB of HBM3 memory handles 70B parameter models for inference in FP16, and NVLink connectivity enables efficient multi-GPU training for larger models. The SXM form factor offers higher memory bandwidth than the PCIe variant, which matters for the memory-bound operations that dominate LLM workloads.
View all H100 SXM5 offers →H200 SXM
The H200 doubles the memory capacity of the H100, reaching 141GB of HBM3e. This is enough to run a 70B parameter model in FP16 on a single GPU, eliminating the complexity and communication overhead of multi-GPU setups. For organizations that want to simplify their inference infrastructure, the H200 offers a meaningful advantage over the H100 despite the higher per-hour cost.
View all H200 SXM offers →B200 SXM
The B200 is NVIDIA's Blackwell architecture GPU with 192GB of HBM3e. It is the most capable single GPU for LLM workloads, able to hold a 100B+ parameter model in memory for inference. For training, the B200's second-generation transformer engine delivers up to 2x the performance of the H100 on FP8 workloads. Availability is limited and pricing reflects the premium positioning.
View all B200 SXM offers →A100 SXM4 80GB
The A100 80GB remains one of the most cost-effective options for LLM inference at moderate scale. It handles 7B to 30B parameter models comfortably in FP16, and its mature software ecosystem means fewer compatibility issues with popular frameworks like vLLM, TGI, and Ollama. The A100 typically costs 40-60% less per hour than an H100, making it the value pick for inference workloads where the latest hardware is not required.
View all A100 SXM4 80GB offers →RTX 4090
The RTX 4090 is the budget option for LLM experimentation. Its 24GB of GDDR6X is enough to run 7B parameter models for inference in FP16, or to fine-tune them using QLoRA (4-bit quantization). It is not suitable for training larger models or for production inference at scale, but for individual researchers and small teams exploring LLMs, the RTX 4090 offers the lowest entry point.
View all RTX 4090 offers →L40S
The L40S sits between the RTX 4090 and the A100 in both price and capability. Its 48GB of GDDR6 handles 13B to 30B parameter models for inference, and its Ada Lovelace architecture includes hardware support for FP8. It is a reasonable choice for teams that need more VRAM than the 4090 provides but do not need the full capability of an A100 or H100.
View all L40S offers →MI300X
AMD's MI300X offers 192GB of HBM3, matching the B200's memory capacity at a lower price point. The tradeoff is software ecosystem maturity: ROCm supports PyTorch and vLLM, but some libraries and optimizations are NVIDIA-only. For organizations comfortable with the AMD stack, the MI300X is a compelling option for inference workloads where memory capacity is the primary constraint.
View all MI300X offers →Live Pricing Comparison
Prices update every 60 seconds. Data from 28 cloud GPU providers tracked by GpuPerHour.
| GPU | VRAM | From | Cheapest On | In Stock | Best For |
|---|---|---|---|---|---|
| H100 SXM5 | 80GB | $1.47/hr | Vast.ai | 0 | 70B+ model training, multi-GPU NVLink clusters |
| H200 SXM | 141GB | $1.97/hr | Vast.ai | 0 | 70B+ model training and inference without multi-GPU |
| B200 SXM | 192GB | $2.67/hr | Vast.ai | 3 | Large-scale training, 100B+ models |
| A100 SXM4 80GB | 80GB | $0.73/hr | Vast.ai | 4 | Budget LLM inference, 7B-30B models |
| RTX 4090 | 24GB | $0.20/hr | Vast.ai | 0 | Small model inference (7B), fine-tuning with QLoRA |
| L40S | 48GB | $0.47/hr | TensorDock | 5 | 13B-30B model inference, moderate training |
| MI300X | 192GB | $0.95/hr | Crusoe | 9 | Large model inference, ROCm-compatible workloads |
Training vs Inference: Which GPU Do You Need?
Training and inference have different hardware requirements. Training runs for hours or days, processes large batches of data, and needs fast inter-GPU communication for distributed workloads. The H100 SXM, H200, and B200 are the standard choices for training because they offer NVLink connectivity, high memory bandwidth, and hardware support for mixed-precision (FP8/BF16) operations that accelerate gradient computation.
Inference runs continuously, processes one request at a time (or small batches), and prioritizes latency and throughput over raw compute. For inference, VRAM capacity is usually the binding constraint: the model must fit in memory. An A100 80GB or RTX 4090 running a quantized model (INT8 or INT4) can serve inference at a fraction of the cost of a training-grade GPU.
Fine-tuning sits between training and inference in terms of requirements. Full fine-tuning of a 7B model needs 40-60GB of VRAM. Parameter-efficient methods like QLoRA reduce this to under 24GB, making the RTX 4090 a viable option. For fine-tuning larger models (30B+), an A100 80GB or H100 is typically required.
Frequently Asked Questions
How much VRAM do I need for LLM training?▾
VRAM requirements depend on the model size and training method. A 7B parameter model needs approximately 28-42GB for full fine-tuning (2-3x the model size in FP16). A 70B model requires 280-420GB, which means multiple GPUs. Parameter-efficient methods like QLoRA reduce requirements significantly: a 7B model can be fine-tuned with under 24GB.
What is the cheapest GPU for running a 70B model?▾
Running a 70B parameter model in FP16 requires approximately 140GB of VRAM. The cheapest single-GPU option is the H200 (141GB). Alternatively, two A100 80GB GPUs (160GB combined) can run the model, though multi-GPU inference adds latency. With INT4 quantization, a 70B model fits in approximately 35GB, making a single A100 80GB the budget option.
Can I use consumer GPUs like the RTX 4090 for LLMs?▾
The RTX 4090 has 24GB of VRAM, which is enough to run 7B parameter models for inference in FP16 or to fine-tune them using QLoRA. It is not suitable for models larger than 13B (which require 26GB+ in FP16) or for production inference serving at scale.
What is the difference between H100 SXM and H100 PCIe for LLMs?▾
Both variants have 80GB of HBM3 memory, but the SXM form factor offers approximately 30% higher memory bandwidth (3.35 TB/s vs 2.0 TB/s) and supports NVLink for multi-GPU communication. For single-GPU inference, the difference is modest. For multi-GPU training, NVLink is essential and only available on the SXM variant.
How much does it cost to fine-tune a 7B model?▾
Fine-tuning a 7B parameter model with QLoRA on a single RTX 4090 typically takes 2-8 hours depending on dataset size and number of epochs. At the cheapest cloud rate, this costs $0.50-$2.00 total. Full fine-tuning on an A100 80GB takes roughly the same wall-clock time but costs more per hour.