Building a Multimodal AI with LLaVA: Combining Vision and Language on a Budget
LLaVA merges image and text models to generate responses that combine visual and language input. It runs efficiently on free-tier platforms like Google Colab using a lightweight setup.

Introduction
Large language models (LLMs) focused on text have dominated recent AI developments, but they represent just the starting point for generative AI. The next step is physical AI — systems that can see, hear, feel, and reason more like humans do. This article introduces a multimodal architecture called LLaVA that combines image and text understanding to generate responses that consider both.
We’ll explore a lightweight setup suitable for free-tier environments like Google Colab, using these components:
- CLIP-ViT B/32 as the image encoder
- TinyLlama-1.1B as the language model
- A 2-layer MLP adapter bridging the two
This approach is based on the paper Visual Instruction Tuning (NeurIPS 2023).
Setup
Before coding, install the necessary dataset library:
!pip install -U datasets
Then import essential packages from Hugging Face and PyTorch to access pre-trained models and utilities for multimodal processing.
Downloading Pre-trained Model Components
The LLaVA model combines:
- A pre-trained CLIP image encoder (
openai/clip-vit-base-patch32
) - A pre-trained Tiny LLaMA language model (
TinyLlama/TinyLlama-1.1B-Chat-v1.0
) - A 2-layer MLP projector connecting the vision and language parts
Weights are downloaded using Hugging Face’s hf_hub_download
utility.
Model Construction
Instantiate the LLaVA Model
We start by loading the configurations for both the vision and text backbones. The LLaVA configuration is then created by combining these, and the model instance is initialized and moved to GPU.
Load Pre-trained Weights
Pre-trained weights come in different formats such as .safetensors
and .bin
. A helper function handles loading the correct format for each backbone.
Inject Weights and Freeze Backbones
The vision and language backbones receive their respective weights with some flexibility to ignore minor mismatches. After loading, both backbones are frozen to keep their weights fixed during training. Only the small MLP adapter between them will be trained, making the process lighter and faster.
A helper function counts total and trainable parameters to confirm the setup.
Processor and Tokenization
To prepare text input, a tokenizer converts words into token IDs. Special tokens like <image>
and <pad>
are added to handle images and padding properly. A chat template formats conversations between user and assistant, mixing text and image references for the model to understand context.
Dataset
We load a publicly available image-text instruction dataset from Hugging Face for training. Each example consists of paired images and messages formatted as conversations.
A custom collator function prepares batches by applying the chat template, tokenizing text, processing images, and masking padding tokens in the labels.
Training
Training is configured with parameters tuned for lightweight environments: small batch sizes, gradient accumulation, a cosine learning rate schedule, and mixed precision for speed. Checkpoints are disabled to save space. The Seq2SeqTrainer
runs the training loop with the model, dataset, and collator.
Inference
For demonstration, an image of the Mona Lisa is loaded from a URL. The conversation prompt asks, “What is represented in the image?” The processor prepares inputs, which are passed to the model to generate a descriptive response. The output shows how the model combines visual and textual understanding.
Extensions and Improvements
- Use larger backbones like CLIP-ViT Large and LLaMA 3.1 8B for better performance
- Train longer to improve instruction-following with multimodal inputs
- Adopt a multi-stage training process:
- Stage 1: Pre-train with frozen backbones on single-turn instructions for feature alignment
- Stage 2: Fine-tune end-to-end on multi-turn instructions, freezing only the image encoder
A demo space is available at huggingface.co/spaces/badayvedat/LLaVA.
Conclusion
This project offers a straightforward look at how multimodal models like LLaVA combine vision and language in one system. Although smaller models and limited training result in modest outputs, the core idea remains: enabling AI to see and talk about images. Running such models efficiently on limited resources presents unique challenges but also opportunities for learning and experimentation.