Paligemma GPU Memory Calculator

An interactive tool to estimate the VRAM requirements for training and inference with Paligemma models.

Configuration

Scenario

Model Size

Precision

Optimizer

Batch Size

Sequence Length

512

Estimated Memory Usage

This section provides a visual and tabular breakdown of the VRAM required for your selected configuration. The bar chart offers a quick comparison of the memory consumed by each component, while the table details the precise calculations. Interact with the controls on the left to see how each parameter affects the total memory footprint.

Component	Memory (GB)	Calculation
Total Estimated Memory		(Subtotal + 15% Overhead)

Precision Guide

Numerical precision determines the number of bytes used to store each parameter in the model. Lower precision reduces memory usage and can speed up computation, but may come with a trade-off in accuracy.

Data Type	Bytes per Parameter	Typical Use Case	Key Considerations
FP32	4 bytes	Baseline calculations	Highest memory cost, but most numerically stable.
BF16	2 bytes	Standard for training LLMs	Best balance of range and precision for stable training.
FP16	2 bytes	Training & Inference	Can suffer from instability (underflow/overflow).
FP8	1 byte	Accelerated Training	Requires modern hardware (e.g., H100).
INT8	1 byte	Inference (Quantization)	4x memory reduction vs FP32. Small potential accuracy loss.
INT4	0.5 bytes	Inference (Quantization)	8x memory reduction. Higher risk of accuracy degradation.

Example Training Script

This script demonstrates a more realistic training setup for Paligemma using PyTorch, `transformers`, and the `datasets` library. The comments highlight key parameters you can change to manage performance and memory, directly relating to the concepts explained in this calculator.

import torch
from datasets import load_dataset
from transformers import (
    PaliGemmaForConditionalGeneration,
    AutoProcessor,
    TrainingArguments,
    Trainer
)
from PIL import Image
import requests

# 1. --- MODEL, PRECISION, AND PERFORMANCE ---

# Parameter: Model Size
# Change this string to load different model sizes. E.g., "google/paligemma-3b-pt-224".
# This is the largest factor in STATIC memory usage (Weights, Gradients, Optimizer States).
model_id = "google/paligemma-3b-pt-224"

# Parameter: Precision
# `torch_dtype` controls the precision of model weights.
# torch.bfloat16 (BF16) is recommended for training on modern GPUs (Ampere+).
model_precision = torch.bfloat16

# Performance Factor: Flash Attention 2
# `attn_implementation="flash_attention_2"` drastically reduces activation memory.
# It is crucial for training with long sequences or large batches.
use_flash_attention = True

processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=model_precision,
    attn_implementation="flash_attention_2" if use_flash_attention else "eager",
    device_map="auto"
)

# 2. --- DATASET AND PREPROCESSING ---

# Load a small sample of a real dataset (e.g., GQA for visual question answering)
# For a real run, you would use the full dataset: `load_dataset("gqa", "all")`
ds = load_dataset('graphcore/gqa-tiny', split='train')

# We need to format the data into a prompt-response structure for the model.
# The prompt for PaliGemma is typically a task prefix like "answer" or "caption".
def preprocess_data(examples):
    # Parameter: Sequence Length (implicitly controlled by max_length in processor)
    # This is a major factor in DYNAMIC memory usage.
    max_length = 128 
    
    # Download and open images. In a real use case, images would be local.
    try:
        image_url = examples['image_url']
        image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
    except Exception:
        # Handle cases where image fails to load
        image = Image.new('RGB', (224, 224))
        
    prompt = "answer " + examples['question']
    
    inputs = processor(
        text=prompt, 
        images=image, 
        return_tensors="pt", 
        padding='max_length', 
        max_length=max_length, 
        truncation=True
    )
    
    # The 'labels' are the tokenized answers.
    labels = processor(
        text=examples['answer'], 
        return_tensors="pt", 
        padding='max_length', 
        max_length=max_length, 
        truncation=True
    ).input_ids
    
    inputs['labels'] = labels
    return inputs

processed_ds = ds.map(preprocess_data, remove_columns=ds.column_names)

# 3. --- TRAINING CONFIGURATION ---

training_args = TrainingArguments(
    output_dir="./paligemma-finetuned-gqa",
    
    # Parameter: Per-Device Batch Size (B)
    # Number of samples on one GPU. Directly impacts dynamic memory.
    per_device_train_batch_size=8,
    
    # Performance Factor: Gradient Accumulation
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps.
    # Allows large effective batch sizes by trading compute for memory.
    gradient_accumulation_steps=4,

    # Parameter: Optimizer
    # 'adamw_torch' is the default (8 bytes/param). For memory savings, consider
    # `optim="paged_adamw_8bit"` which requires `bitsandbytes`.
    optim="adamw_torch",
    
    # Parameter: Precision (for training loop)
    # Use BF16 mixed-precision training. Requires Ampere+ GPU. Set to `fp16=True` for older GPUs.
    bf16=True,
    
    # Performance Factor: Gradient Checkpointing
    # Trades compute for memory by not storing all activations.
    # Dramatically reduces activation memory, allowing larger batches or longer sequences.
    gradient_checkpointing=True,

    # Performance Factor: Dataloader Workers
    # `dataloader_num_workers` uses multiple CPU processes to prepare data batches.
    # Prevents the GPU from waiting for data, improving throughput. Uses CPU RAM.
    dataloader_num_workers=4,
    
    # Other standard arguments
    num_train_epochs=1,
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-5,
    report_to="none", # Can be "wandb", "tensorboard"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_ds,
)

# Start the training process
trainer.train()

Paligemma GPU Memory Calculator

Estimated Memory Usage

Precision Guide

Core Memory Concepts

Key Memory Formulas

Training Memory

Inference Memory

Legend

Model Weights

Gradients

Optimizer States

Adam / AdamW (8 bytes/param)

AdamW 8-bit (2 bytes/param)

Adafactor / SGD with Momentum (~4 bytes/param)

Stateless SGD (0 bytes/param)

Activations & KV Cache

Sequence Length (Context Window)

Example Training Script