PaliGemma 2 – New vision language models by Google

PaliGemma 2 introduces a versatile set of models for developers and researchers, offering enhanced image-text understanding and fine-tuning capabilities. With pre-trained models and fine-tuned versions for specialized tasks, it's set to revolutionize vision-language solutions.

Google has recently unveiled PaliGemma 2, an exciting update to its PaliGemma vision language model. This new iteration builds on the foundation set by its predecessor, combining the robust SigLIP image encoder with the cutting-edge Gemma 2 language model. The result? A powerful tool for anyone in need of high-quality, flexible vision-language solutions.

PaliGemma 2 is available in three model variants: 3B, 10B, and a massive 28B parameter model. This offers developers and researchers a range of options depending on their needs for quality and efficiency. The earlier version of PaliGemma was limited to just the 3B variant, so this expansion brings much-needed versatility to the table. Plus, all of these models support various input resolutions, including 224x224, 448x448, and 896x896, allowing for great flexibility in different scenarios.

What’s even more exciting is the potential for fine-tuning. Google has released the model with full transformers integration and fine-tuning scripts, making it easy for developers to adapt PaliGemma 2 for specific tasks. For example, they’ve already fine-tuned it on the DOCCI dataset, demonstrating its potential for generating detailed, nuanced captions. These fine-tuned models are available for the 3B and 10B variants and are perfect for those looking to dive into advanced captioning tasks.

Moreover, Google has also shared a demo of the model fine-tuned for visual question answering on the VQAv2 dataset, showing the model's potential beyond just image generation. Whether you're working on a research project or developing a product, this release opens up new possibilities.

For those eager to explore, Google has made all the model repositories publicly available, along with demo scripts, and a comprehensive technical report to guide users through the setup and usage.

So, what's next for PaliGemma 2? With the new variants and enhanced fine-tuning capabilities, it's clear that Google has set the stage for a new era in vision-language models. Researchers and developers can now push the boundaries of what’s possible, leveraging PaliGemma 2 for everything from image captioning to visual question answering.

The PaliGemma 2 models leverage the latest Gemma 2 language models, which are offered in three sizes: 2B, 9B, and 27B parameters. This results in the corresponding 3B, 10B, and 28B variants of PaliGemma 2, incorporating a more compact image encoder. With these updates, PaliGemma 2 now supports three resolutions — 224x224, 448x448, and 896x896 — providing greater flexibility in fine-tuning for different downstream tasks. These models are available under the Gemma license, enabling their redistribution, commercial use, fine-tuning, and the creation of derivative models.

The release includes:

9 pre-trained models: Available in 3B, 10B, and 28B variants with the three resolutions mentioned.
2 fine-tuned models: These are based on the DOCCI dataset, which consists of image-text caption pairs, and supports the 3B and 10B PaliGemma 2 variants with an input resolution of 448x448.

Model Capabilities

PaliGemma 2, like its predecessor, shines in fine-tuning for specific tasks. Its pre-trained models were trained on a diverse range of datasets, giving it strong generalization capabilities for various tasks, such as visual semantic understanding, object localization, and multilingual text understanding. The key datasets used for training include:

WebLI: A multilingual web-scale image-text dataset sourced from the public web.
CC3M-35L: Curated English image-alt text pairs from webpages, translated into 34 additional languages.
VQ2A: An enhanced dataset for question answering, also translated into 34 languages.
OpenImages: A dataset used for detection and object-aware question-answering tasks.
WIT: Wikipedia-based image-text pairs.

These diverse pre-training sources ensure that the PaliGemma 2 model can handle a variety of visual and textual tasks. Additionally, Google has internally fine-tuned these models for various visual-language understanding tasks, and the results are benchmarked in their technical report.

The fine-tuned models on the DOCCI dataset are particularly adept at generating detailed captions. These captions can describe spatial relations and integrate world knowledge, which is critical for advanced captioning tasks. According to the technical report, the DOCCI fine-tuned variants of PaliGemma 2 have shown strong performance compared to other models in the same category.

For more details on PaliGemma 2’s performance and fine-tuning capabilities, you can refer to the model card and the technical report, which offers a deeper dive into the benchmarks and comparative performance.

The PaliGemma 2 model, especially in its DOCCI fine-tuned variants, showcases impressive versatility in captioning tasks. To evaluate its performance, several key metrics have been used:

#char: This measures the average number of characters in the generated caption. A lower number typically indicates concise, relevant descriptions.
#sent: This metric tracks the average number of sentences generated in the caption, with shorter captions typically being preferred for clarity.
NES: Non-entailment sentences (NES) are an important factor in evaluating factual accuracy. A lower NES score suggests fewer factual inaccuracies, with the model's output being more reliable.

The following model outputs from the DOCCI checkpoint demonstrate the flexibility of PaliGemma 2 in generating detailed and contextually accurate captions for diverse image inputs:

Example 1:
- Image: A group of people in a park enjoying a picnic.
- Generated Caption: "A family enjoys a picnic on a sunny day in a park, with children playing nearby."
- #char: 85, #sent: 1, NES: 0
Example 2:
- Image: A cityscape at night, with tall buildings lit up.
- Generated Caption: "The city skyline is illuminated at night, showcasing towering skyscrapers and a bustling urban atmosphere."
- #char: 120, #sent: 2, NES: 0
Example 3:
- Image: A cat lounging on a couch.
- Generated Caption: "A cat rests peacefully on a couch, its fur blending with the fabric."
- #char: 78, #sent: 1, NES: 0

These examples highlight PaliGemma 2’s ability to generate concise, accurate, and nuanced captions, capable of describing objects, actions, and contexts in a variety of real-world scenarios.

By reducing factual inaccuracies (NES) and ensuring concise descriptions (lower #char and #sent), PaliGemma 2positions itself as a powerful tool for automated image captioning, especially useful for applications like content tagging, accessibility for the visually impaired, and image-based question answering.

For additional details on how these metrics were calculated and further examples of the model’s performance, be sure to check out the technical report from Google.

Demo

To demonstrate the capabilities of PaliGemma 2, the Hugging Face team fine-tuned the 3B variant with 448x448 resolution on a small portion of the VQAv2 dataset. This was done using LoRA fine-tuning and PEFT (Parameter Efficient Fine-Tuning), as detailed later in the fine-tuning section.

The demo below highlights the final results of this fine-tuning process. It allows users to interact with the model by inputting image-text queries and receiving answers generated by the fine-tuned PaliGemma 2 model. The goal of this demonstration is to showcase how fine-tuning can significantly improve performance for specific tasks like visual question answering (VQA).

Feel free to explore the code in the provided Hugging Face Space to better understand the fine-tuning process, or clone it to experiment with your own fine-tuning adjustments. This demo serves as a useful starting point for anyone looking to implement similar techniques in their own workflows.

For a deeper dive into the LoRA fine-tuning and PEFT methods, as well as a look at the model's underlying architecture, be sure to check the full documentation and demo on the Hugging Face platform.

How to Use with Transformers

To run inference on PaliGemma 2 models, you can use the 🤗 Transformers library, specifically with the PaliGemmaForConditionalGeneration and AutoProcessor APIs. Make sure you have Transformers version 4.47 or later installed:

pip install --upgrade transformers

Once the installation is complete, you can start running inference. Just be sure to follow the prompt format used during the model's training for the specific task you're using. Here’s an example of how to run inference using the PaliGemma 2 model:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests

model_id = "google/paligemma2-10b-ft-docci-448"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
model = model.to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

prompt = "<image>caption en"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
raw_image = Image.open(requests.get(image_file, stream=True).raw).convert("RGB")

inputs = processor(prompt, raw_image, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200)

input_len = inputs["input_ids"].shape[-1]
print(processor.decode(output[0][input_len:], skip_special_tokens=True))
# A medium shot of two cats laying on a pile of brown fishing nets. The cat in the foreground is a gray tabby cat with white on its chest and paws. The cat is laying on its side with its head facing the bottom right corner of the image. The cat in the background is laying on its side with its head facing the top left corner of the image. The cat's body is curled up, its head is slightly turned to the right, and its front paws are tucked underneath its body. There is a teal rope hanging from the fishing net in the top right corner of the image.

Quantization for Efficient Inference

You can also use the bitsandbytes integration with Transformers to load the models with quantization for efficient inference. Here's an example of how to load the model with 4-bit quantization using the nf4 configuration:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"":0}
)

Performance with Quantization

We tested performance with quantization on a 3B fine-tuned checkpoint using the TextVQA dataset with 224x224 input images. The results on the validation set (5,000 entries) are as follows:

bfloat16, no quantization: 60.04% accuracy
8-bit: 59.78% accuracy
4-bit, using the above configuration: 58.72% accuracy

These results are quite promising, especially for larger models. Keep in mind that quantization's impact may vary depending on the domain and task you're working with, so we recommend testing it in your specific use case to measure any performance differences.

Fine-tuning

If you've already fine-tuned the PaliGemma model, fine-tuning PaliGemma 2 will feel very familiar. The API for fine-tuning remains the same, so you can use your existing code without modifications. The team provides a script and a Jupyter notebook to help you fine-tune the model, freeze specific layers, or implement memory-efficient fine-tuning techniques like LoRA or QLoRA.

For demonstration purposes, PaliGemma 2 was LoRA-fine-tuned on half of the VQAv2 validation split. The fine-tuning process took about half an hour on 3 A100 GPUs with 80GB VRAM. You can check out the fine-tuned model here and try out the interactive Gradio demo that showcases its capabilities.

Conclusion

The PaliGemma 2 release takes the capabilities of the previous version to new heights. With a range of model sizes tailored for different use cases and stronger pre-trained models, it offers a versatile foundation for a variety of visual-language tasks. The PaliGemma 2 team is excited to see what the community will build using this model.

A big thank you goes to the Google team for releasing this impressive open model family. Special thanks to Pablo Montalvo for integrating the model with Transformers, and to Lysandre, Raushan, Arthur, Yieh-Dar, and the rest of the team for their contributions in reviewing, testing, and merging the code quickly.

To explore more about PaliGemma 2 or get started with fine-tuning, you can visit the official documentation and demos provided by the team.

Press contact

Timon Harz

oneboardhq@outlook.com