“The most atomic way to train and inference a GPT in pure, dependency-free Python. Everything else is just efficiency.” — Andrej Karpathy
In this article, we will break down Andrej Karpathy’s legendary 243-line MicroGPT into simple, understandable sections.
No PyTorch. No GPUs. No deep learning libraries.
Just mathematics, Python, and probability.
Let’s go step by step.
if not os.path.exists('input.txt'):
import urllib.request
names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt'
urllib.request.urlretrieve(names_url, 'input.txt')
docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)Example names:
emma
liam
olivia
noahThis GPT does character-level learning.
uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1Example:
a → 0
b → 1
c → 2
...This is our vocabulary.
Unlike big GPTs, this model:
The Value class is the heart of this file.
class Value:This implements:
Every number in the model is wrapped inside a Value.
Example:
a = Value(2.0)
b = Value(3.0)
c = a * bNow:
c.data = 6When we call:
loss.backward()The chain rule flows backward through every operation.
This is a mini PyTorch, written from scratch.
n_embd = 16
n_head = 4
n_layer = 1
block_size = 16We define:
| Parameter | Meaning |
|---|---|
| n_embd | Embedding size |
| n_head | Attention heads |
| n_layer | Transformer layers |
| block_size | Max sequence length |
state_dict['wte'] # token embeddings
state_dict['wpe'] # positional embeddings
state_dict['lm_head'] # output layerAnd for each layer:
All initialized randomly:
random.gauss(0, std)At this stage, the model knows nothing.
def gpt(token_id, pos_id, keys, values):This is the Transformer.
It follows GPT-2 style architecture:
x = token_embedding + position_embeddingThis gives meaning + position awareness.
Instead of LayerNorm, Karpathy uses RMSNorm:
x = rmsnorm(x)This stabilizes training.
For each head:
attn_logits = q ⋅ k / sqrt(head_dim)
attn_weights = softmax(attn_logits)
head_out = weighted sum of valuesThis allows the token to:
Even at character level!
x = linear → ReLU → linearThis allows:
logits = linear(x, lm_head)This produces:
Not probabilities yet — just logits.
Each name becomes:
[BOS] a n n a [BOS]At each position:
Loss:
loss_t = -log(prob[target])This is cross-entropy loss.
The model learns by minimizing:
Average negative log likelihoodloss.backward()This computes gradients for every parameter using:
This is pure math.
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2Adam uses:
Parameters update:
p.data -= learning_rate * adjusted_gradientTraining runs for:
1000 stepstemperature = 0.5We start with BOS and repeatedly:
If BOS is generated → stop.
Example output:
anita
kane
torian
kalanaMany of these:
This is generative behavior.
1️⃣ GPT does not store words 2️⃣ It learns probability distributions 3️⃣ Creativity emerges from sampling 4️⃣ Temperature controls randomness 5️⃣ Transformers are just math + matrices
Because:
This is transformer architecture in its most atomic form.
Large GPTs have billions of parameters.
This one has only thousands.
Yet the core idea is identical:
Attention + MLP + Backpropagation + Probability Sampling
Everything else is scale and efficiency.
Thanks to Andrej Karpathy, we can witness generative intelligence emerge from 243 lines of pure Python.