Attention Is All You Need (1706.03762) — Transformer paper summary

Q: What does the paper claim is new about the Transformer compared to prior sequence transduction models?

The paper proposes the Transformer as a simple network architecture based solely on attention mechanisms.[S1] The paper states that this design dispenses with recurrence and convolutions entirely.[S1]

The paper proposes the Transformer, a sequence transduction architecture built solely on attention mechanisms and designed to remove recurrence and convolutions from encoder-decoder models. It reports superior machine translation quality with improved parallelizability and substantially reduced training time, and it also reports successful application to English constituency parsing.

What this paper is about

The paper states that dominant sequence transduction models use complex recurrent or convolutional neural networks in an encoder-decoder configuration.[S1] The paper states that the best performing versions of these models also connect the encoder and decoder through an attention mechanism.[S1] The paper proposes a new network architecture called the Transformer.[S1] The paper states that the Transformer is based solely on attention mechanisms.[S1] The paper states that the Transformer dispenses with recurrence and convolutions entirely.[S1] The paper reports that experiments on two machine translation tasks show the proposed models are superior in quality.[S1] The paper reports that these models are more parallelizable.[S1] The paper reports that these models require significantly less time to train.[S1] The paper reports a result of 28.4 BLEU on the WMT 2014 English-to-German translation task.[S1] The paper states that this result improves over existing best results, including ensembles, by over 2 BLEU.[S1] The paper reports a BLEU score of 41.8 on the WMT 2014 English-to-French translation task for a single model.[S1] The paper reports that the English-to-French result was obtained after training for 3.5 days on eight GPUs.[S1] The paper states that this training cost is a small fraction of the training costs of the best models from the literature.[S1] The paper reports that the Transformer generalizes to other tasks by applying it successfully to English constituency parsing with large and limited training data.[S1]

Core claims to remember

The paper states that sequence transduction models were dominantly recurrent or convolutional encoder-decoder systems at the time of writing.[S1] The paper states that top-performing encoder-decoder systems connected the encoder and decoder through attention mechanisms.[S1] The paper proposes the Transformer as a simple architecture built solely from attention mechanisms.[S1] The paper states that the Transformer eliminates both recurrence and convolutions from the model design.[S1] The paper reports that, on two machine translation tasks, Transformer models are superior in quality.[S1] The paper reports that the Transformer models are more parallelizable in comparison to the dominant recurrent or convolutional approaches it discusses.[S1] The paper reports that Transformer models require significantly less time to train.[S1] The paper reports that the model achieves 28.4 BLEU on WMT 2014 English-to-German.[S1] The paper states that the WMT 2014 English-to-German result improves over existing best results, including ensembles, by over 2 BLEU.[S1] The paper reports that, on WMT 2014 English-to-French, the model achieves a single-model BLEU score of 41.8.[S1] The paper reports that the WMT 2014 English-to-French score was obtained after 3.5 days of training on eight GPUs.[S1] The paper states that this training regime represents a small fraction of the training costs of the best models from the literature.[S1] The paper reports that the Transformer generalizes beyond machine translation via successful application to English constituency parsing.[S1] The paper reports that the English constituency parsing application was successful both with large training data and with limited training data.[S1]

Limitations and caveats

The paper’s reported experimental evidence in the snippet is centered on two machine translation tasks and English constituency parsing.[S1] The paper reports machine translation quality using BLEU for WMT 2014 English-to-German and WMT 2014 English-to-French.[S1] The paper reports specific headline numbers for these tasks, including 28.4 BLEU for English-to-German and 41.8 BLEU for English-to-French.[S1] The paper frames the English-to-French model as a single-model state-of-the-art result, which is a claim tied to that task and evaluation context.[S1] The paper explicitly compares its English-to-German result against existing best results and mentions improvements over ensembles by over 2 BLEU.[S1] The paper reports a particular training setup and duration for the English-to-French result, namely 3.5 days of training on eight GPUs.[S1] The paper states that this training cost is a small fraction of the training costs of the best models from the literature, but the snippet does not enumerate those costs.[S1] The paper reports that the Transformer generalizes to English constituency parsing, but the snippet does not provide the parsing metric values.[S1] The paper emphasizes removal of recurrence and convolutions, but the snippet does not describe other architectural components beyond attention mechanisms.[S1]

How to apply this in study or projects

Focus your reading on the paper’s central design move, which is to build a sequence transduction architecture solely from attention mechanisms while removing recurrence and convolutions.[S1] Compare this design against the paper’s described baseline landscape of recurrent or convolutional encoder-decoder models that also use attention to connect encoder and decoder.[S1] Use the WMT 2014 English-to-German and WMT 2014 English-to-French settings as reference tasks when you want to mirror the paper’s evaluation context for translation.[S1] Track the exact reported outcomes that the paper highlights, including 28.4 BLEU on English-to-German and 41.8 BLEU on English-to-French, so you can keep your comparisons anchored to the claims actually reported.[S1] Separate quality claims from efficiency claims in your notes, because the paper reports both superior quality and significantly less training time along with improved parallelizability.[S1] Treat the paper’s English-to-French training description as a concrete reproducibility target, because it reports 3.5 days of training on eight GPUs for the 41.8 BLEU single-model result.[S1] If you are exploring transfer beyond translation, include English constituency parsing as a follow-up task because the paper reports successful application there with both large and limited training data.[S1] Keep your project conclusions scoped to the tasks and comparisons the paper reports, because the snippet only names two translation benchmarks and one parsing setting.[S1]

Attention Is All You Need (1706.03762) — Transformer paper brief

What this paper is about

Core claims to remember

Limitations and caveats

How to apply this in study or projects

Sources

FAQ

What does the paper claim is new about the Transformer compared to prior sequence transduction models?

What results does the paper report on WMT 2014 translation benchmarks?

Related reads