meta-llama/Llama-3.3-70B-Instruct
Llama 3.3 70B dense model with NVIDIA FP8/FP4 quantized variants for Hopper and Blackwell GPUs
View on HuggingFaceGuide
Overview
Llama 3.3 70B Instruct is Meta's 70-billion parameter dense language model. NVIDIA provides FP8 and FP4 quantized variants optimized for Hopper (H100/H200) and Blackwell (B200/GB200) GPUs. FP4 is Blackwell-only and provides the best VRAM efficiency.
Prerequisites
- Hardware: 1x H100/H200 (FP8), 1x B200 (FP4), or 2x GPUs for BF16
- vLLM >= 0.12.0
- CUDA Driver >= 575
- Docker with NVIDIA Container Toolkit (recommended)
Client Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="nvidia/Llama-3.3-70B-Instruct-FP8",
messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)
Troubleshooting
FP4 variant not loading: FP4 is only supported on Blackwell (compute capability 10.0). Use FP8 on Hopper.
OOM with BF16 on single GPU: Use the FP8 variant (~70 GB) or FP4 variant (~40 GB) to fit on a single GPU.
References
Configuration Matrix
| Variant | Precision | Min VRAM | Notes |
|---|---|---|---|
| Default | BF16 | 170 GB | Full precision BF16 |
| FP8 | FP8 | 84 GB | NVIDIA FP8 quantization for Hopper and Blackwell |
| NVFP4 | NVFP4 | 42 GB | NVIDIA NVFP4 quantized weights for Blackwell GPUs |