meta-llama/Llama-3.3-70B-Instruct

Llama 3.3 70B dense model with NVIDIA FP8/FP4 quantized variants for Hopper and Blackwell GPUs

dense70B131,072 ctxvLLM 0.12.0+text

Guide

Overview

Llama 3.3 70B Instruct is Meta's 70-billion parameter dense language model. NVIDIA provides FP8 and FP4 quantized variants optimized for Hopper (H100/H200) and Blackwell (B200/GB200) GPUs. FP4 is Blackwell-only and provides the best VRAM efficiency.

Prerequisites

Hardware: 1x H100/H200 (FP8), 1x B200 (FP4), or 2x GPUs for BF16
vLLM >= 0.12.0
CUDA Driver >= 575
Docker with NVIDIA Container Toolkit (recommended)

Client Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="nvidia/Llama-3.3-70B-Instruct-FP8",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)

Troubleshooting

FP4 variant not loading: FP4 is only supported on Blackwell (compute capability 10.0). Use FP8 on Hopper.

OOM with BF16 on single GPU: Use the FP8 variant (~70 GB) or FP4 variant (~40 GB) to fit on a single GPU.

References

Configuration Matrix

Variant	Precision	Min VRAM	Notes
Default	BF16	170 GB	Full precision BF16
FP8	FP8	84 GB	NVIDIA FP8 quantization for Hopper and Blackwell
NVFP4	NVFP4	42 GB	NVIDIA NVFP4 quantized weights for Blackwell GPUs