CLIP (Contrastive Language-Image Pretraining)

Motivation

In the past, computer vision datasets were hand-crafted/manually-labeled, which is expensive and rigid. We needed a natural language text from the internet as labels, which allows models to recognize far more concepts and generalize much better to new data/classes/tasks

Tldr

CLIP jointly trains two neural networks:

An image encoder to convert images into visual embeddings

A text encoder to convert text descriptions into text embeddings

Purpose: to match corresponding (image, text) pairs from the internet and distinguish them from non-matching pairs — learning this through a contrastive learning objective where correct pairs are brought together and wrong pairs are brought apart.

Methods

Data Collection

400 million image + text pairs were scraped from the internet, with noisy/diverse text descriptions that weren’t explicitly curated

Model Architecture

Image Encoder

Two main architectures, a modified Resnet and Vision Transformer, producing an image embedding $v_{i} \in R^{D}$

Text Encoder

Transformer language model with 12 layers, 512 dimensions, 8 attention heads; operating on text sequences (BPE encoding)

Text embedding $v_{t} \in R^{D}$ is extracted from the Transformer’s (end of sequence) token representation

Both embeddings are then normalized.

Contrastive Learning Objective

Intuition: Teaching a model what belongs together (bringing the right pairs closer, wrong pairs apart).

With a batch of $N$ image-text pairs, we compute the similarity: $s (i, j) = v_{i}^{T} v_{t_{j}}$ Then use the InfoNCE loss function, applied in both directions:

1st Direction: image to text

$L_{i2t} = \frac{1}{N} \sum_{i = 1}^{N} - lo g (\frac{e ^{s (i, i) / τ}}{\sum _{j = 1}^{N} e ^{s (i, j) / τ}})$

$s (i, i)$ is the similarity of the correct pair The denominator is the similarity of the image with all texts in the batch $τ$ learnable temperature parameter

2nd Direction: text to image

$L_{t2i} = \frac{1}{N} \sum_{i = 1}^{N} - lo g (\frac{e ^{s (j, j) / τ}}{\sum _{j = 1}^{N} e ^{s (i, j) / τ}})$

Final CLIP Loss (symmetric):

$L_{C L I P} = \frac{1}{2} (L_{i 2 t} + L_{t 2 i})$

$L_{CLIP} = \frac{1}{2 N} (\sum_{i = 1}^{N} - lo g (\frac{e ^{s (i, i) / τ}}{\sum _{j = 1}^{N} e ^{s (i, j) / τ}}) + \sum_{j = 1}^{N} - lo g (\frac{e ^{s (j, j) / τ}}{\sum _{i = 1}^{N} e ^{s (i, j) / τ}}))$

Overall the goal is to average both directions, so that the model learns to align both image-to-text and text-to-image

Zero-Shot Classification with CLIP

During inference, CLIP can classify an image without fine-tuning, by comparing it to a list of possible text prompts:

e.g.,
“a photo of a cat”,
“a photo of a dog”,
“a photo of a pizza”

It picks the text with the highest cosine similarity to the image embedding.

Why Large Batch Size Matters

InfoNCE becomes more effective with larger batch sizes, as each image-text pair gets more negative examples (non-matching pairs).
CLIP uses contrast across the entire batch, so bigger batches mean sharper discrimination.

Brayden Zhang

Explorer

CLIP (Contrastive Language-Image Pretraining)

Methods

Graph View

Backlinks