Learning Transferable Visual Models From Natural Language...

Paper brief: Learning Transferable Visual Models From Natural Language Supervision (arXiv:2103.00020)

The paper reports a method for learning image representations from scratch by predicting which caption matches which image on 400 million internet-collected (image, text) pairs, and then using natural language to enable zero-shot transfer to downstream tasks.

February 25, 2026•Mira Vale•ml foundations

Continue in Rorobot with the source paper open and ready for chat.

Open this paper in Rorobot

What this paper is about

State-of-the-art computer vision systems are described as being trained to predict a fixed set of predetermined object categories. [S1] The paper states that this restricted form of supervision limits the generality and usability of such systems because additional labeled data is needed to specify any other visual concept. [S1] The paper presents learning directly from raw text about images as an alternative that leverages a much broader source of supervision. [S1] The paper reports a simple pre-training task in which a model predicts which caption goes with which image. [S1] The paper describes this caption–image matching task as an efficient and scalable way to learn state-of-the-art image representations from scratch. [S1] The paper reports that pre-training is performed on a dataset of 400 million (image, text) pairs collected from the internet. [S1] The paper states that, after pre-training, natural language is used to reference learned visual concepts or to describe new ones. [S1] The paper states that this use of natural language enables zero-shot transfer of the model to downstream tasks. [S1] The paper reports that it studies performance by benchmarking this approach on over 30 different tasks. [S1]

Core claims to remember

The paper states that many state-of-the-art vision systems are trained to predict a fixed set of predetermined object categories. [S1] The paper states that this fixed-category supervision restricts generality and usability because other visual concepts require additional labeled data. [S1] The paper presents learning from raw text about images as a promising alternative supervision source. [S1] The paper reports that a simple contrastive-style pre-training objective, phrased as predicting which caption goes with which image, is used. [S1] The paper characterizes this caption–image matching pre-training task as efficient and scalable. [S1] The paper reports that this pre-training learns state-of-the-art image representations from scratch. [S1] The paper reports that the training data consists of 400 million (image, text) pairs collected from the internet. [S1] The paper states that natural language can reference learned visual concepts after pre-training. [S1] The paper states that natural language can also describe new visual concepts after pre-training. [S1] The paper states that using natural language in this way enables zero-shot transfer to downstream tasks. [S1] The paper reports that it benchmarks the approach on over 30 different tasks to study performance. [S1]

Limitations and caveats

The paper states that existing state-of-the-art computer vision systems are often trained on a fixed set of predetermined object categories. [S1] The paper states that this restricted supervision limits generality and usability because additional labeled data is needed to specify any other visual concept. [S1]

How to apply this in study or projects

Read the parts describing the problem with “fixed set of predetermined object categories” supervision and the statement that additional labeled data is needed to specify other visual concepts. [S1] Read the parts describing learning directly from raw text about images as an alternative source of supervision. [S1] Read the parts defining the pre-training task as predicting which caption goes with which image. [S1] Read the parts describing why the caption–image prediction task is characterized as efficient and scalable for learning image representations from scratch. [S1] Read the parts describing the dataset as 400 million (image, text) pairs collected from the internet. [S1] Read the parts describing how natural language is used after pre-training to reference learned visual concepts or describe new ones. [S1] Read the parts stating that this natural-language interface enables zero-shot transfer to downstream tasks. [S1] Read the parts describing benchmarking on over 30 different tasks as the way the paper studies performance. [S1]

Sources

[S1]arxiv.org
Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
Open source Back to article

FAQ

What training signal does arXiv:2103.00020 use for pre-training?

The paper reports a simple pre-training task of predicting which caption goes with which image. [S1]

What does the paper claim enables zero-shot transfer to downstream tasks?

The paper states that, after pre-training, natural language is used to reference learned visual concepts or describe new ones, enabling zero-shot transfer to downstream tasks. [S1]