Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

8 authors

2017

Neural Machine Translation

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Extracted Hypotheses

5 hypotheses identified

92% confidence

ready

The Transformer architecture, relying entirely on self-attention mechanisms without recurrence or convolution, can achieve superior performance on machine translation tasks compared to existing sequence-to-sequence models.

Dataset: WMT 2014 English-German

BLEU ScoreTraining Time+1 more

87% confidence

ready

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, leading to improved translation quality over single-head attention.

Dataset: WMT 2014 English-French

Translation QualityAttention Visualization+1 more

78% confidence

ready

Positional encoding using sinusoidal functions enables the model to learn relative positions effectively, allowing it to generalize to sequence lengths longer than those seen during training.

Dataset: Custom Position Dataset

Position AccuracySequence Length Generalization+1 more

85% confidence

processing

The scaled dot-product attention mechanism is computationally more efficient than additive attention while maintaining comparable or superior performance in sequence modeling tasks.

Dataset: Synthetic Attention Dataset

Computational ComplexityMemory Usage+1 more

73% confidence

ready

Layer normalization applied before each sub-layer (pre-norm) rather than after (post-norm) leads to more stable training and better convergence in deep transformer networks.

Dataset: Deep Network Training Dataset

Training StabilityConvergence Rate+1 more

AI Research Reproducibility Copilot

Workflow Progress

Attention Is All You Need

Extracted Hypotheses