// Deep Dive · AI Architecture

Transformer Models

📅 Updated: June 2026 ⏱ 8 min read 🏷 Deep Learning · NLP · Architecture
Self-Attention Encoder-Decoder BERT GPT LLM Parallelism HPC Training

What is a Transformer?

The Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. It fundamentally changed AI by replacing recurrent and convolutional layers with a mechanism called self-attention, enabling models to process entire sequences in parallel rather than step-by-step.

// Why it matters

Before Transformers, RNNs processed tokens sequentially — slow and unable to capture long-range dependencies. Transformers process all tokens simultaneously, making them dramatically faster to train on HPC hardware (GPUs/TPUs) and better at understanding context across long documents.

The Core Idea: Self-Attention

Self-attention allows each token in a sequence to "attend" to every other token, computing a weighted representation based on relevance. For each token, three vectors are computed:

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V

The scaling factor √dₖ prevents the dot products from growing too large in high-dimensional spaces, which would push the softmax into regions with extremely small gradients.

Multi-Head Attention

Rather than computing a single attention function, Transformers use multi-head attention — running attention in parallel across multiple "heads", each learning different types of relationships (syntactic, semantic, positional). The outputs are concatenated and linearly projected.

Architecture Overview

The original Transformer has two main components:

ComponentFunctionUsed in
EncoderProcesses input sequence into contextual representationsBERT, classification, embeddings
DecoderGenerates output sequence token by token using encoder outputGPT, translation, generation
Encoder-DecoderFull architecture for sequence-to-sequence tasksT5, BART, translation models

Each encoder/decoder layer consists of:

Positional Encoding

Since Transformers have no inherent notion of sequence order (unlike RNNs), positional encodings are added to token embeddings. The original paper used fixed sinusoidal encodings; modern models often use learned positional embeddings or more advanced schemes like RoPE (Rotary Position Embedding).

BERT vs GPT: Two Paradigms

The Transformer architecture split into two dominant pre-training paradigms:

PropertyBERT (2018, Google)GPT (2018, OpenAI)
ArchitectureEncoder-onlyDecoder-only
Training ObjectiveMasked Language Modeling (MLM)Causal Language Modeling (CLM)
DirectionalityBidirectional (sees full context)Unidirectional (left-to-right)
Best forClassification, NER, QA, embeddingsText generation, dialogue, reasoning
Scaling trendModest (BERT-Large: 340M params)Extreme (GPT-4: ~1T+ params)

Major Transformer Models (2017–2026)

BERT
Google · 2018
Bidirectional encoder. Dominated NLP benchmarks. 110M–340M parameters.
GPT-3
OpenAI · 2020
175B parameter decoder. Demonstrated emergent few-shot capabilities.
T5
Google · 2020
Encoder-decoder framing all NLP tasks as text-to-text problems.
LLaMA 3
Meta · 2024
Open-weight decoder models up to 405B parameters. Strong open-source baseline.
GPT-4
OpenAI · 2023
Multimodal, likely Mixture-of-Experts. State-of-the-art reasoning.
Vision Transformer (ViT)
Google · 2020
Applying Transformers to image patches — replaced CNNs for many vision tasks.

Why Transformers Dominate HPC Training

From an HPC perspective, Transformers have a critical advantage: massive parallelism. Unlike RNNs which must process tokens sequentially, attention computations across all token pairs can be fully parallelized across GPU tensor cores.

// HPC Perspective

Training GPT-3 required approximately 3.14 × 10²³ FLOPs — roughly 355 GPU-years on a single A100. In practice, Meta trained LLaMA 2 on 2,000 A100 GPUs for 21 days. This is why HPC infrastructure is inseparable from modern AI development.

Key training techniques used at scale:

Limitations and Open Challenges

Despite their dominance, Transformers have notable weaknesses:

ChallengeDescriptionActive Research
Quadratic complexityAttention cost scales as O(n²) with sequence lengthFlash Attention, Mamba, linear attention
Context lengthLong documents strain memory and computeSliding window, sparse attention, RAG
HallucinationModels generate confident but incorrect factsRLHF, RAG, constitutional AI
Energy costTraining large models consumes megawatt-hoursEfficient architectures, MoE, distillation
InterpretabilityAttention maps don't fully explain predictionsMechanistic interpretability research

Key Takeaways