Transformer Models
What is a Transformer?
The Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. It fundamentally changed AI by replacing recurrent and convolutional layers with a mechanism called self-attention, enabling models to process entire sequences in parallel rather than step-by-step.
Before Transformers, RNNs processed tokens sequentially — slow and unable to capture long-range dependencies. Transformers process all tokens simultaneously, making them dramatically faster to train on HPC hardware (GPUs/TPUs) and better at understanding context across long documents.
The Core Idea: Self-Attention
Self-attention allows each token in a sequence to "attend" to every other token, computing a weighted representation based on relevance. For each token, three vectors are computed:
- Query (Q) — what the token is looking for
- Key (K) — what information each token offers
- Value (V) — the actual content to be aggregated
The scaling factor √dₖ prevents the dot products from growing too large in high-dimensional spaces, which would push the softmax into regions with extremely small gradients.
Multi-Head Attention
Rather than computing a single attention function, Transformers use multi-head attention — running attention in parallel across multiple "heads", each learning different types of relationships (syntactic, semantic, positional). The outputs are concatenated and linearly projected.
Architecture Overview
The original Transformer has two main components:
| Component | Function | Used in |
|---|---|---|
| Encoder | Processes input sequence into contextual representations | BERT, classification, embeddings |
| Decoder | Generates output sequence token by token using encoder output | GPT, translation, generation |
| Encoder-Decoder | Full architecture for sequence-to-sequence tasks | T5, BART, translation models |
Each encoder/decoder layer consists of:
- Multi-head self-attention sublayer
- Feed-forward network (two linear layers with ReLU)
- Layer normalization and residual connections
Positional Encoding
Since Transformers have no inherent notion of sequence order (unlike RNNs), positional encodings are added to token embeddings. The original paper used fixed sinusoidal encodings; modern models often use learned positional embeddings or more advanced schemes like RoPE (Rotary Position Embedding).
BERT vs GPT: Two Paradigms
The Transformer architecture split into two dominant pre-training paradigms:
| Property | BERT (2018, Google) | GPT (2018, OpenAI) |
|---|---|---|
| Architecture | Encoder-only | Decoder-only |
| Training Objective | Masked Language Modeling (MLM) | Causal Language Modeling (CLM) |
| Directionality | Bidirectional (sees full context) | Unidirectional (left-to-right) |
| Best for | Classification, NER, QA, embeddings | Text generation, dialogue, reasoning |
| Scaling trend | Modest (BERT-Large: 340M params) | Extreme (GPT-4: ~1T+ params) |
Major Transformer Models (2017–2026)
Why Transformers Dominate HPC Training
From an HPC perspective, Transformers have a critical advantage: massive parallelism. Unlike RNNs which must process tokens sequentially, attention computations across all token pairs can be fully parallelized across GPU tensor cores.
Training GPT-3 required approximately 3.14 × 10²³ FLOPs — roughly 355 GPU-years on a single A100. In practice, Meta trained LLaMA 2 on 2,000 A100 GPUs for 21 days. This is why HPC infrastructure is inseparable from modern AI development.
Key training techniques used at scale:
- Tensor Parallelism — splitting weight matrices across GPUs
- Pipeline Parallelism — distributing layers across GPU groups
- Data Parallelism — running independent batches in parallel
- Flash Attention — memory-efficient attention computation reducing HBM bandwidth
- Mixed Precision (BF16/FP16) — halving memory usage with minimal accuracy loss
Limitations and Open Challenges
Despite their dominance, Transformers have notable weaknesses:
| Challenge | Description | Active Research |
|---|---|---|
| Quadratic complexity | Attention cost scales as O(n²) with sequence length | Flash Attention, Mamba, linear attention |
| Context length | Long documents strain memory and compute | Sliding window, sparse attention, RAG |
| Hallucination | Models generate confident but incorrect facts | RLHF, RAG, constitutional AI |
| Energy cost | Training large models consumes megawatt-hours | Efficient architectures, MoE, distillation |
| Interpretability | Attention maps don't fully explain predictions | Mechanistic interpretability research |
Key Takeaways
- Transformers replaced RNNs by enabling full parallelization of sequence processing
- Self-attention is the core mechanism — every token attends to every other token
- Two paradigms: encoder (BERT) for understanding, decoder (GPT) for generation
- Scaling Transformers requires serious HPC infrastructure — thousands of GPUs, specialized interconnects
- Active research continues on efficiency (quadratic bottleneck) and reliability (hallucination)