Motivation

In the past, computer vision datasets were hand-crafted/manually-labeled, which is expensive and rigid. We needed a natural language text from the internet as labels, which allows models to recognize far more concepts and generalize much better to new data/classes/tasks

Tldr

CLIP jointly trains two neural networks:

  • An image encoder to convert images into visual embeddings
  • A text encoder to convert text descriptions into text embeddings

Purpose: to match corresponding (image, text) pairs from the internet and distinguish them from non-matching pairs — learning this through a contrastive learning objective where correct pairs are brought together and wrong pairs are brought apart.

Methods

Data Collection

400 million image + text pairs were scraped from the internet, with noisy/diverse text descriptions that weren’t explicitly curated

Model Architecture

  1. Image Encoder
  • Two main architectures, a modified Resnet and Vision Transformer, producing an image embedding
  1. Text Encoder
  • Transformer language model with 12 layers, 512 dimensions, 8 attention heads; operating on text sequences (BPE encoding)
  • Text embedding is extracted from the Transformer’s (end of sequence) token representation

Both embeddings are then normalized.

Contrastive Learning Objective

Intuition: Teaching a model what belongs together (bringing the right pairs closer, wrong pairs apart).

With a batch of image-text pairs, we compute the similarity: Then use the InfoNCE loss function, applied in both directions:

1st Direction: image to text

is the similarity of the correct pair The denominator is the similarity of the image with all texts in the batch learnable temperature parameter

2nd Direction: text to image

Final CLIP Loss (symmetric):

Overall the goal is to average both directions, so that the model learns to align both image-to-text and text-to-image

Zero-Shot Classification with CLIP

During inference, CLIP can classify an image without fine-tuning, by comparing it to a list of possible text prompts:

e.g.,
“a photo of a cat”,
“a photo of a dog”,
“a photo of a pizza”

It picks the text with the highest cosine similarity to the image embedding.

Why Large Batch Size Matters

InfoNCE becomes more effective with larger batch sizes, as each image-text pair gets more negative examples (non-matching pairs).
CLIP uses contrast across the entire batch, so bigger batches mean sharper discrimination.