paper brief

Paper brief: ADADELTA (arXiv:1212.5701)

The paper presents ADADELTA as a novel per-dimension learning rate method for gradient descent that dynamically adapts over time using only first order information. The paper also states that ADADELTA has minimal computational overhead beyond vanilla stochastic gradient descent, requires no manual tuning of a learning rate, and appears robust to several kinds of variation and noise. The snippet reports promising results compared to other methods on MNIST using a single machine and on a large scale voice dataset in a distributed cluster environment.

February 25, 2026•Mira Vale•ml foundations

Continue in Rorobot with the source paper open and ready for chat.

Open this paper in Rorobot

What this paper is about

The paper presents a method called ADADELTA and describes it as a novel per-dimension learning rate method for gradient descent.[S1] The paper states that ADADELTA dynamically adapts over time.[S1] The paper states that ADADELTA uses only first order information.[S1] The paper reports that ADADELTA has minimal computational overhead beyond vanilla stochastic gradient descent.[S1] The paper states that ADADELTA requires no manual tuning of a learning rate.[S1]

The snippet characterizes ADADELTA as appearing robust to noisy gradient information.[S1] The snippet also characterizes ADADELTA as appearing robust to different model architecture choices.[S1] The snippet also characterizes ADADELTA as appearing robust to various data modalities.[S1] The snippet also characterizes ADADELTA as appearing robust to selection of hyperparameters.[S1]

For empirical results, the snippet reports that the paper shows promising results compared to other methods on the MNIST digit classification task using a single machine.[S1] The snippet also reports that the paper shows promising results on a large scale voice dataset in a distributed cluster environment.[S1]

Core claims to remember

ADADELTA is presented as a novel per-dimension learning rate method for gradient descent.[S1] ADADELTA is described as dynamically adapting over time.[S1] ADADELTA is described as using only first order information.[S1] ADADELTA is described as having minimal computational overhead beyond vanilla stochastic gradient descent.[S1] ADADELTA is described as requiring no manual tuning of a learning rate.[S1] ADADELTA is described as appearing robust to noisy gradient information.[S1] ADADELTA is described as appearing robust to different model architecture choices.[S1] ADADELTA is described as appearing robust to various data modalities.[S1] ADADELTA is described as appearing robust to selection of hyperparameters.[S1] The snippet reports promising results compared to other methods on the MNIST digit classification task using a single machine.[S1] The snippet reports promising results on a large scale voice dataset in a distributed cluster environment.[S1]

Limitations and caveats

The snippet uses the wording “appears robust,” which reports robustness as an appearance in the snippet’s phrasing.[S1] The snippet summarizes the outcomes as “promising results,” which is a qualitative characterization in the snippet’s wording.[S1] The snippet reports results in two settings: MNIST digit classification using a single machine and a large scale voice dataset in a distributed cluster environment.[S1]

How to apply this in study or projects

Copy the paper’s one-sentence description of ADADELTA as “a novel per-dimension learning rate method for gradient descent,” and annotate each phrase directly in your notes.[S1] Separate the snippet’s method-description clauses into a short list that includes “dynamically adapts over time,” “using only first order information,” and “minimal computational overhead beyond vanilla stochastic gradient descent. [S1] ”[S1] Extract the snippet’s statement “requires no manual tuning of a learning rate” as a standalone claim and place it next to the method-description list.[S1]

Create a small claim table with four rows for the snippet’s robustness statements and fill each row with the exact wording: “noisy gradient information,” “different model architecture choices,” “various data modalities,” and “selection of hyperparameters. [S1] ”[S1] Add a note beside each row that the snippet uses the wording “appears robust” for each robustness statement.[S1]

Make a two-column summary of the empirical settings, with one column labeled “MNIST digit classification task using a single machine” and the other labeled “large scale voice dataset in a distributed cluster environment. [S1] ”[S1] Under each column, copy the snippet’s evaluation phrasing that the paper shows “promising results compared to other methods. ”[S1]

Sources

[S1]arxiv.org
Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
Open source Back to article

FAQ

What does the paper claim ADADELTA is?

The paper presents ADADELTA as a novel per-dimension learning rate method for gradient descent that dynamically adapts over time using only first order information.[S1]

What evaluation settings are mentioned in the snippet?

The snippet reports results on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.[S1]