// Deep Dive · AI Architecture

Transformer Models

📅 Updated: June 2026 ⏱ 8 min read 🏷 Deep Learning · NLP · Architecture

What is a Transformer?

The Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. It fundamentally changed AI by replacing recurrent and convolutional layers with a mechanism called self-attention, enabling models to process entire sequences in parallel rather than step-by-step.

// Why it matters

Before Transformers, RNNs processed tokens sequentially — slow and unable to capture long-range dependencies. Transformers process all tokens simultaneously, making them dramatically faster to train on HPC hardware (GPUs/TPUs) and better at understanding context across long documents.

The Core Idea: Self-Attention

Self-attention allows each token in a sequence to "attend" to every other token, computing a weighted representation based on relevance. For each token, three vectors are computed:

Query (Q) — what the token is looking for
Key (K) — what information each token offers
Value (V) — the actual content to be aggregated

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V

The scaling factor √dₖ prevents the dot products from growing too large in high-dimensional spaces, which would push the softmax into regions with extremely small gradients.

Multi-Head Attention

Rather than computing a single attention function, Transformers use multi-head attention — running attention in parallel across multiple "heads", each learning different types of relationships (syntactic, semantic, positional). The outputs are concatenated and linearly projected.

Architecture Overview

The original Transformer has two main components:

Component	Function	Used in
Encoder	Processes input sequence into contextual representations	BERT, classification, embeddings
Decoder	Generates output sequence token by token using encoder output	GPT, translation, generation
Encoder-Decoder	Full architecture for sequence-to-sequence tasks	T5, BART, translation models

Each encoder/decoder layer consists of:

Multi-head self-attention sublayer
Feed-forward network (two linear layers with ReLU)
Layer normalization and residual connections

Positional Encoding

Since Transformers have no inherent notion of sequence order (unlike RNNs), positional encodings are added to token embeddings. The original paper used fixed sinusoidal encodings; modern models often use learned positional embeddings or more advanced schemes like RoPE (Rotary Position Embedding).

BERT vs GPT: Two Paradigms

The Transformer architecture split into two dominant pre-training paradigms:

Property	BERT (2018, Google)	GPT (2018, OpenAI)
Architecture	Encoder-only	Decoder-only
Training Objective	Masked Language Modeling (MLM)	Causal Language Modeling (CLM)
Directionality	Bidirectional (sees full context)	Unidirectional (left-to-right)
Best for	Classification, NER, QA, embeddings	Text generation, dialogue, reasoning
Scaling trend	Modest (BERT-Large: 340M params)	Extreme (GPT-4: ~1T+ params)

Major Transformer Models (2017–2026)

BERT

Google · 2018

Bidirectional encoder. Dominated NLP benchmarks. 110M–340M parameters.

GPT-3

OpenAI · 2020

175B parameter decoder. Demonstrated emergent few-shot capabilities.

Google · 2020

Encoder-decoder framing all NLP tasks as text-to-text problems.

LLaMA 3

Meta · 2024

Open-weight decoder models up to 405B parameters. Strong open-source baseline.

GPT-4

OpenAI · 2023

Multimodal, likely Mixture-of-Experts. State-of-the-art reasoning.

Vision Transformer (ViT)

Google · 2020

Applying Transformers to image patches — replaced CNNs for many vision tasks.

Why Transformers Dominate HPC Training

From an HPC perspective, Transformers have a critical advantage: massive parallelism. Unlike RNNs which must process tokens sequentially, attention computations across all token pairs can be fully parallelized across GPU tensor cores.

// HPC Perspective

Training GPT-3 required approximately 3.14 × 10²³ FLOPs — roughly 355 GPU-years on a single A100. In practice, Meta trained LLaMA 2 on 2,000 A100 GPUs for 21 days. This is why HPC infrastructure is inseparable from modern AI development.

Key training techniques used at scale:

Tensor Parallelism — splitting weight matrices across GPUs
Pipeline Parallelism — distributing layers across GPU groups
Data Parallelism — running independent batches in parallel
Flash Attention — memory-efficient attention computation reducing HBM bandwidth
Mixed Precision (BF16/FP16) — halving memory usage with minimal accuracy loss

Limitations and Open Challenges

Despite their dominance, Transformers have notable weaknesses:

Challenge	Description	Active Research
Quadratic complexity	Attention cost scales as O(n²) with sequence length	Flash Attention, Mamba, linear attention
Context length	Long documents strain memory and compute	Sliding window, sparse attention, RAG
Hallucination	Models generate confident but incorrect facts	RLHF, RAG, constitutional AI
Energy cost	Training large models consumes megawatt-hours	Efficient architectures, MoE, distillation
Interpretability	Attention maps don't fully explain predictions	Mechanistic interpretability research

Key Takeaways

Transformers replaced RNNs by enabling full parallelization of sequence processing
Self-attention is the core mechanism — every token attends to every other token
Two paradigms: encoder (BERT) for understanding, decoder (GPT) for generation
Scaling Transformers requires serious HPC infrastructure — thousands of GPUs, specialized interconnects
Active research continues on efficiency (quadratic bottleneck) and reliability (hallucination)