Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Extracted Hypotheses
The Transformer architecture, relying entirely on self-attention mechanisms without recurrence or convolution, can achieve superior performance on machine translation tasks compared to existing sequence-to-sequence models.
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, leading to improved translation quality over single-head attention.
Positional encoding using sinusoidal functions enables the model to learn relative positions effectively, allowing it to generalize to sequence lengths longer than those seen during training.
The scaled dot-product attention mechanism is computationally more efficient than additive attention while maintaining comparable or superior performance in sequence modeling tasks.
Layer normalization applied before each sub-layer (pre-norm) rather than after (post-norm) leads to more stable training and better convergence in deep transformer networks.