🚀 Building GPT in Pure Python — The 243-Line MicroGPT Explained

“The most atomic way to train and inference a GPT in pure, dependency-free Python. Everything else is just efficiency.” — Andrej Karpathy

In this article, we will break down Andrej Karpathy’s legendary 243-line MicroGPT into simple, understandable sections.

No PyTorch. No GPUs. No deep learning libraries.

Just mathematics, Python, and probability.

Let’s go step by step.


1️⃣ The Dataset — What Are We Training On?

if not os.path.exists('input.txt'):
    import urllib.request
    names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'
    urllib.request.urlretrieve(names_url, 'input.txt')

docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)

What’s happening?

Example names:

emma
liam
olivia
noah

This GPT does character-level learning.


2️⃣ Tokenization — Turning Characters into Numbers

uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

What’s happening?

Example:

a → 0
b → 1
c → 2
...

This is our vocabulary.

Unlike big GPTs, this model:


3️⃣ Autograd — Building Our Own Backpropagation Engine

The Value class is the heart of this file.

class Value:

This implements:

Every number in the model is wrapped inside a Value.

Example:

a = Value(2.0)
b = Value(3.0)
c = a * b

Now:

When we call:

loss.backward()

The chain rule flows backward through every operation.

This is a mini PyTorch, written from scratch.


4️⃣ Model Parameters — What Does GPT Learn?

n_embd = 16
n_head = 4
n_layer = 1
block_size = 16

We define:

Parameter Meaning
n_embd Embedding size
n_head Attention heads
n_layer Transformer layers
block_size Max sequence length

Parameter Matrices

state_dict['wte']  # token embeddings
state_dict['wpe']  # positional embeddings
state_dict['lm_head']  # output layer

And for each layer:

All initialized randomly:

random.gauss(0, std)

At this stage, the model knows nothing.


5️⃣ Core Architecture — The GPT Function

def gpt(token_id, pos_id, keys, values):

This is the Transformer.

It follows GPT-2 style architecture:

Step 1: Token + Position Embedding

x = token_embedding + position_embedding

This gives meaning + position awareness.


Step 2: RMSNorm

Instead of LayerNorm, Karpathy uses RMSNorm:

x = rmsnorm(x)

This stabilizes training.


Step 3: Multi-Head Attention

For each head:

attn_logits = q ⋅ k / sqrt(head_dim)
attn_weights = softmax(attn_logits)
head_out = weighted sum of values

This allows the token to:

Even at character level!


Step 4: MLP Block

x = linear → ReLU → linear

This allows:


Step 5: Output Layer

logits = linear(x, lm_head)

This produces:

Not probabilities yet — just logits.


6️⃣ Training — Teaching GPT to Predict the Next Character

Each name becomes:

[BOS] a n n a [BOS]

At each position:

Loss:

loss_t = -log(prob[target])

This is cross-entropy loss.

The model learns by minimizing:

Average negative log likelihood

7️⃣ Backpropagation

loss.backward()

This computes gradients for every parameter using:

This is pure math.


8️⃣ Optimizer — Adam

m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2

Adam uses:

Parameters update:

p.data -= learning_rate * adjusted_gradient

Training runs for:

1000 steps

9️⃣ Inference — Generating New Names

temperature = 0.5

We start with BOS and repeatedly:

  1. Get logits
  2. Apply temperature scaling
  3. Softmax
  4. Sample next token
  5. Repeat

If BOS is generated → stop.

Example output:

anita
kane
torian
kalana

Many of these:

This is generative behavior.


🔥 What This MicroGPT Proves

1️⃣ GPT does not store words 2️⃣ It learns probability distributions 3️⃣ Creativity emerges from sampling 4️⃣ Temperature controls randomness 5️⃣ Transformers are just math + matrices


🧠 Why This Is Special

Because:

This is transformer architecture in its most atomic form.


🏁 Final Thought

Large GPTs have billions of parameters.

This one has only thousands.

Yet the core idea is identical:

Attention + MLP + Backpropagation + Probability Sampling

Everything else is scale and efficiency.

Thanks to Andrej Karpathy, we can witness generative intelligence emerge from 243 lines of pure Python.