Building a Multimodal AI with LLaVA: Combining Vision and Language on a Budget

LLaVA merges image and text models to generate responses that combine visual and language input. It runs efficiently on free-tier platforms like Google Colab using a lightweight setup.

Published on: Jun 18, 2025
Building a Multimodal AI with LLaVA: Combining Vision and Language on a Budget

Introduction

Large language models (LLMs) focused on text have dominated recent AI developments, but they represent just the starting point for generative AI. The next step is physical AI — systems that can see, hear, feel, and reason more like humans do. This article introduces a multimodal architecture called LLaVA that combines image and text understanding to generate responses that consider both.

We’ll explore a lightweight setup suitable for free-tier environments like Google Colab, using these components:

  • CLIP-ViT B/32 as the image encoder
  • TinyLlama-1.1B as the language model
  • A 2-layer MLP adapter bridging the two

This approach is based on the paper Visual Instruction Tuning (NeurIPS 2023).

Setup

Before coding, install the necessary dataset library:

!pip install -U datasets

Then import essential packages from Hugging Face and PyTorch to access pre-trained models and utilities for multimodal processing.

Downloading Pre-trained Model Components

The LLaVA model combines:

  • A pre-trained CLIP image encoder (openai/clip-vit-base-patch32)
  • A pre-trained Tiny LLaMA language model (TinyLlama/TinyLlama-1.1B-Chat-v1.0)
  • A 2-layer MLP projector connecting the vision and language parts

Weights are downloaded using Hugging Face’s hf_hub_download utility.

Model Construction

Instantiate the LLaVA Model

We start by loading the configurations for both the vision and text backbones. The LLaVA configuration is then created by combining these, and the model instance is initialized and moved to GPU.

Load Pre-trained Weights

Pre-trained weights come in different formats such as .safetensors and .bin. A helper function handles loading the correct format for each backbone.

Inject Weights and Freeze Backbones

The vision and language backbones receive their respective weights with some flexibility to ignore minor mismatches. After loading, both backbones are frozen to keep their weights fixed during training. Only the small MLP adapter between them will be trained, making the process lighter and faster.

A helper function counts total and trainable parameters to confirm the setup.

Processor and Tokenization

To prepare text input, a tokenizer converts words into token IDs. Special tokens like <image> and <pad> are added to handle images and padding properly. A chat template formats conversations between user and assistant, mixing text and image references for the model to understand context.

Dataset

We load a publicly available image-text instruction dataset from Hugging Face for training. Each example consists of paired images and messages formatted as conversations.

A custom collator function prepares batches by applying the chat template, tokenizing text, processing images, and masking padding tokens in the labels.

Training

Training is configured with parameters tuned for lightweight environments: small batch sizes, gradient accumulation, a cosine learning rate schedule, and mixed precision for speed. Checkpoints are disabled to save space. The Seq2SeqTrainer runs the training loop with the model, dataset, and collator.

Inference

For demonstration, an image of the Mona Lisa is loaded from a URL. The conversation prompt asks, “What is represented in the image?” The processor prepares inputs, which are passed to the model to generate a descriptive response. The output shows how the model combines visual and textual understanding.

Extensions and Improvements

  • Use larger backbones like CLIP-ViT Large and LLaMA 3.1 8B for better performance
  • Train longer to improve instruction-following with multimodal inputs
  • Adopt a multi-stage training process:
    • Stage 1: Pre-train with frozen backbones on single-turn instructions for feature alignment
    • Stage 2: Fine-tune end-to-end on multi-turn instructions, freezing only the image encoder

A demo space is available at huggingface.co/spaces/badayvedat/LLaVA.

Conclusion

This project offers a straightforward look at how multimodal models like LLaVA combine vision and language in one system. Although smaller models and limited training result in modest outputs, the core idea remains: enabling AI to see and talk about images. Running such models efficiently on limited resources presents unique challenges but also opportunities for learning and experimentation.