What this paper is about
The paper studies the relationship between L2 regularization and weight decay regularization in gradient-based training.[S1] The paper states that L2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent when weight decay is rescaled by the learning rate.[S1] The paper reports that this equivalence does not hold for adaptive gradient algorithms such as Adam.[S1]
The paper notes that common implementations of adaptive algorithms employ L2 regularization while often calling it “weight decay,” and the paper describes this naming as potentially misleading given the inequivalence it reports.[S1] The paper proposes a simple modification intended to “recover the original formulation of weight decay regularization” by decoupling weight decay from the optimization steps taken with respect to the loss function.[S1]
The paper reports empirical evidence about two practical outcomes of the proposed modification.[S1] The paper reports that the modification decouples the optimal choice of the weight decay factor from the learning rate setting for standard SGD and for Adam.[S1] The paper also reports that the modification substantially improves Adam’s generalization performance.[S1]
Core claims to remember
The paper states that L2 regularization and weight decay regularization are equivalent for standard SGD when weight decay is rescaled by the learning rate.[S1] The paper demonstrates that L2 regularization and weight decay regularization are not equivalent for adaptive gradient algorithms such as Adam.[S1]
The paper states that many common adaptive-optimizer implementations use L2 regularization while calling it “weight decay,” and the paper describes this as potentially misleading in light of the inequivalence it exposes.[S1] The paper proposes decoupling weight decay from the optimization steps taken with respect to the loss function as a way to recover the original formulation of weight decay regularization.[S1]
The paper reports empirical evidence that the proposed decoupled weight decay makes the optimal weight decay factor less tied to the learning rate setting for both SGD and Adam.[S1] The paper reports empirical evidence that the modification substantially improves Adam’s generalization performance.[S1] The paper reports that, with the modification, Adam can compete with SGD with momentum on image classification datasets where the paper states Adam was previously typically outperformed by SGD with momentum.[S1]
The paper states that the proposed decoupled weight decay has already been adopted by many researchers.[S1] The paper states that the community has implemented the approach in TensorFlow and PyTorch.[S1]
Limitations and caveats
The snippet describes results “on image classification datasets” but does not name the specific datasets in the snippet.[S1] The snippet reports improved generalization for Adam and competitiveness with SGD with momentum, but the snippet does not provide the numeric metrics or experimental settings used to support those claims.[S1]
The snippet states that equivalence holds for standard SGD when weight decay is rescaled by the learning rate, so the rescaling condition is part of the statement of equivalence in the paper.[S1] The snippet frames the key problem as an inequivalence for adaptive gradient algorithms such as Adam, so conclusions in the paper are explicitly tied to adaptive methods rather than only to SGD.[S1]
The snippet characterizes the proposed change as “a simple modification,” but the snippet does not spell out the full algorithmic procedure beyond stating that weight decay is decoupled from the optimization steps taken with respect to the loss function.[S1] The snippet states that common implementations may call L2 regularization “weight decay,” but the snippet does not list which specific implementations or versions are being referenced.[S1]
The snippet states that the community has implemented the method in TensorFlow and PyTorch, but the snippet does not describe the exact interfaces or default behaviors in those frameworks.[S1] The snippet states that the method has been adopted by many researchers, but the snippet does not quantify adoption or list specific follow-up papers.[S1]
How to apply this in study or projects
When you study regularization in optimization, you can treat this paper as making a conditional equivalence claim: L2 regularization and weight decay regularization are equivalent for standard SGD only when rescaled by the learning rate.[S1] When you study or use adaptive optimizers, you can treat this paper as explicitly claiming that the equivalence between L2 regularization and weight decay regularization breaks for adaptive gradient algorithms such as Adam.[S1]
When you read code or documentation, you can use the paper’s warning that implementations may apply L2 regularization while calling it “weight decay,” because the paper states this naming can be misleading given the inequivalence it reports.[S1] When you compare training recipes, you can use the paper’s proposed modification as the reference point for “weight decay regularization” in the sense that the paper describes recovering the original formulation by decoupling weight decay from loss-based optimization steps.[S1]
When you plan hyperparameter searches, you can use the paper’s empirical claim that decoupled weight decay decouples the optimal weight decay factor from the learning rate setting for both standard SGD and Adam.[S1] When you evaluate optimizer choices, you can treat the paper as reporting that the proposed modification substantially improves Adam’s generalization performance.[S1]
When you benchmark on image classification, you can treat the paper as reporting that the modified Adam can compete with SGD with momentum on image classification datasets where SGD with momentum was previously typically better, as stated in the paper.[S1] When you look for implementation references, you can note that the paper states the approach has been implemented in TensorFlow and PyTorch by the community.[S1]