What this paper is about
Training deep neural networks is complicated because the distribution of each layer’s inputs changes during training as the parameters of previous layers change.[S1] The paper states that this changing input distribution slows down training by requiring lower learning rates and careful parameter initialization.[S1] The paper also reports that this phenomenon makes it notoriously hard to train models with saturating nonlinearities.[S1]
The paper names this phenomenon “internal covariate shift. [S1] ”[S1] The paper addresses internal covariate shift by normalizing layer inputs.[S1] The method builds normalization into the model architecture and performs the normalization for each training mini-batch.[S1] The paper calls this approach Batch Normalization.[S1]
Core claims to remember
The paper states that the distribution of each layer’s inputs changes during training because the parameters of previous layers change.[S1] The paper reports that this distribution change slows training by requiring lower learning rates and careful parameter initialization.[S1] The paper describes internal covariate shift as the name for this training phenomenon.[S1]
The paper addresses internal covariate shift by normalizing layer inputs.[S1] The paper states that the method draws strength from making normalization a part of the model architecture.[S1] The paper states that the method performs the normalization for each training mini-batch.[S1]
The paper reports that Batch Normalization allows the use of much higher learning rates.[S1] The paper reports that Batch Normalization allows being less careful about initialization.[S1] The paper reports that Batch Normalization makes it notoriously hard to train models with saturating nonlinearities, and it presents normalization as the way it addresses that difficulty.[S1]
The paper reports that Batch Normalization also acts as a regularizer.[S1] The paper reports that Batch Normalization, in some cases, eliminates the need for Dropout.[S1]
Limitations and caveats
The paper reports that training deep neural networks is complicated by the fact that the distribution of each layer’s inputs changes during training as the parameters of previous layers change.[S1] The paper reports that this training complication slows down training by requiring lower learning rates and careful parameter initialization.[S1] The paper reports that this complication makes it notoriously hard to train models with saturating nonlinearities.[S1]
The paper presents Batch Normalization as the method that addresses the internal covariate shift phenomenon by normalizing layer inputs as part of the architecture and doing so per training mini-batch.[S1]
How to apply this in study or projects
An application of the paper’s method consists of normalizing layer inputs.[S1] An application of the paper’s method treats normalization as part of the model architecture rather than as an external preprocessing step.[S1] An application of the paper’s method performs the normalization for each training mini-batch.[S1]
An application of the paper’s training claims involves using much higher learning rates in the presence of Batch Normalization.[S1] An application of the paper’s training claims involves being less careful about parameter initialization when Batch Normalization is used.[S1]
An application of the paper’s regularization claim involves treating Batch Normalization as a regularizer.[S1] An application of the paper’s regularization claim involves the stated possibility that Batch Normalization, in some cases, eliminates the need for Dropout.[S1]
A study activity based on the paper involves writing down the paper’s definition of “internal covariate shift” as the phenomenon where the distribution of each layer’s inputs changes during training as the parameters of previous layers change.[S1] A study activity based on the paper involves tracing how the paper connects internal covariate shift to lower learning rates, careful initialization, and difficulty with saturating nonlinearities.[S1] A study activity based on the paper involves summarizing the paper’s stated mechanism for addressing internal covariate shift, namely normalizing layer inputs within the architecture on each mini-batch.[S1]