Motivation
In the past, computer vision datasets were hand-crafted/manually-labeled, which is expensive and rigid. We needed a natural language text from the internet as labels, which allows models to recognize far more concepts and generalize much better to new data/classes/tasks
Tldr
CLIP jointly trains two neural networks:
- An image encoder to convert images into visual embeddings
- A text encoder to convert text descriptions into text embeddings
Purpose: to match corresponding (image, text) pairs from the internet and distinguish them from non-matching pairs — learning this through a contrastive learning objective where correct pairs are brought together and wrong pairs are brought apart.
Methods
Data Collection
400 million image + text pairs were scraped from the internet, with noisy/diverse text descriptions that weren’t explicitly curated
Model Architecture
- Image Encoder
- Two main architectures, a modified Resnet and Vision Transformer, producing an image embedding
- Text Encoder
- Transformer language model with 12 layers, 512 dimensions, 8 attention heads; operating on text sequences (BPE encoding)
- Text embedding is extracted from the Transformer’s (end of sequence) token representation
Both embeddings are then normalized.
Contrastive Learning Objective
Intuition: Teaching a model what belongs together (bringing the right pairs closer, wrong pairs apart).
With a batch of image-text pairs, we compute the similarity: Then use the InfoNCE loss function, applied in both directions:
1st Direction: image to text
is the similarity of the correct pair The denominator is the similarity of the image with all texts in the batch learnable temperature parameter
2nd Direction: text to image
Final CLIP Loss (symmetric):
Overall the goal is to average both directions, so that the model learns to align both image-to-text and text-to-image
Zero-Shot Classification with CLIP
During inference, CLIP can classify an image without fine-tuning, by comparing it to a list of possible text prompts:
e.g.,
“a photo of a cat”,
“a photo of a dog”,
“a photo of a pizza”It picks the text with the highest cosine similarity to the image embedding.
Why Large Batch Size Matters
InfoNCE becomes more effective with larger batch sizes, as each image-text pair gets more negative examples (non-matching pairs).
CLIP uses contrast across the entire batch, so bigger batches mean sharper discrimination.