paper brief

Learning Transferable Visual Models From Natural Language Supervision — paper brief

This brief summarizes the paper “Learning Transferable Visual Models From Natural Language Supervision,” which trains vision models by predicting which caption matches which image on 400 million internet (image, text) pairs, then uses natural language to enable zero-shot transfer to many downstream tasks.

February 25, 2026•Mira Vale•ml foundations

Continue in Rorobot with the source paper open and ready for chat.

Open this paper in Rorobot

What this paper is about

Many computer vision systems are trained to predict a fixed set of predetermined object categories, and the paper states that this restricted form of supervision limits generality and usability because additional labeled data is needed to specify any other visual concept.[S1] The paper presents learning directly from raw text about images as an alternative, and it states that this approach leverages a much broader source of supervision.[S1] The paper reports a simple pre-training task: predicting which caption goes with which image.[S1] The paper states that this caption–image prediction task is an efficient and scalable way to learn state-of-the-art image representations from scratch.[S1] The paper reports that the pre-training data is a dataset of 400 million (image, text) pairs collected from the internet.[S1] After pre-training, the paper states that natural language is used to reference learned visual concepts or describe new ones, enabling zero-shot transfer of the model to downstream tasks.[S1] The paper states that it studies performance by benchmarking on over 30 different existing datasets.[S1]

Core claims to remember

The paper states that training vision systems to predict a fixed set of predetermined object categories restricts supervision and limits generality and usability because more labeled data is needed for other visual concepts.[S1] The paper states that learning directly from raw text about images is a promising alternative that leverages broader supervision than fixed category labels.[S1] The paper reports that pre-training by predicting which caption goes with which image is the core learning task used to train the visual representation.[S1] The paper states that this pre-training task is efficient and scalable for learning state-of-the-art image representations from scratch.[S1] The paper reports using a dataset of 400 million (image, text) pairs collected from the internet for pre-training.[S1] The paper states that natural language can reference learned visual concepts and can describe new ones after pre-training.[S1] The paper states that natural language use after pre-training enables zero-shot transfer of the model to downstream tasks.[S1] The paper states that it evaluates the approach by benchmarking on over 30 different existing datasets.[S1]

Limitations and caveats

The paper states that state-of-the-art computer vision systems trained on fixed predetermined object categories have restricted supervision that limits generality and usability.[S1] The paper states that additional labeled data is needed to specify any other visual concept when training uses a fixed set of predetermined object categories.[S1] The paper reports that its pre-training data is collected from the internet, and it describes the dataset as 400 million (image, text) pairs.[S1]

How to apply this in study or projects

Read the parts of the paper that define the pre-training task of predicting which caption goes with which image, and track how this task is used to learn image representations from scratch.[S1] Locate the description of the dataset scale and provenance, and record the paper’s statement that it uses 400 million (image, text) pairs collected from the internet.[S1] Follow the paper’s description of how natural language is used after pre-training to reference learned visual concepts or describe new ones, and connect that description to the paper’s claim of zero-shot transfer to downstream tasks.[S1] List the downstream evaluations mentioned in the paper’s benchmarking summary, and use the paper’s statement that it benchmarks on over 30 different existing datasets as the organizing structure for that list.[S1]

Sources

[S1]arxiv.org
Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
Open source Back to article

FAQ

What is the pre-training task used in “Learning Transferable Visual Models From Natural Language Supervision”?

The paper reports a simple pre-training task of predicting which caption goes with which image.[S1]

What dataset scale and evaluation scope does the paper report?

The paper reports pre-training on a dataset of 400 million (image, text) pairs collected from the internet.[S1] The paper states that it studies performance by benchmarking on over 30 different existing datasets.[S1]