Homebrew GPT-2

A ground-up implementation of the GPT-2 decoder-only transformer architecture, trained on the Fineweb dataset. No pretrained weights, no shortcuts - just the architecture and a lot of compute.

Why Build This

Using language model APIs is easy. Understanding what’s actually happening inside them is harder. This project was about building that understanding by implementing every component from scratch.

Architecture

The standard GPT-2 decoder-only transformer:

Input Tokens

Token Embeddings + Positional Embeddings

┌─────────────────────────────────────┐
│  Transformer Block (×N layers)     │
│  ┌─────────────────────────────┐   │
│  │ Layer Norm                  │   │
│  │ Causal Self-Attention       │   │
│  │ Residual Connection         │   │
│  │ Layer Norm                  │   │
│  │ Feed-Forward Network        │   │
│  │ Residual Connection         │   │
│  └─────────────────────────────┘   │
└─────────────────────────────────────┘

Layer Norm

Linear → Vocabulary Logits

Components Implemented

  1. Multi-Head Causal Self-Attention

    • Query, Key, Value projections
    • Scaled dot-product attention with causal masking
    • Multiple attention heads with concatenation
  2. Position-wise Feed-Forward Network

    • Two linear transformations with GELU activation
    • 4x expansion in hidden dimension
  3. Training Infrastructure

    • AdamW optimizer with weight decay
    • Learning rate warmup and cosine decay
    • Gradient clipping
    • Mixed precision training

Training Data

Trained on Fineweb, a large-scale web text dataset. The dataset provides diverse, high-quality text for language modeling.

Key Learnings

  • Attention is expensive: The O(n²) scaling with sequence length is very real
  • Initialization matters: Bad initialization can make training unstable or prevent convergence entirely
  • The devil is in the details: Small implementation bugs (wrong dimension, missing normalization) cause silent failures

Tech Stack

  • Python, PyTorch
  • Fineweb dataset
  • Custom training loop (no Trainer abstractions)

https://github.com/Mkrolick/Homebrew-GPT-2