Library | Purpose |
---|---|
torch |
Core deep learning library (PyTorch) used for building and training neural networks |
transformers |
Hugging Face library that provides utilities to build, train, and load state-of-the-art transformer models |
datasets |
Hugging Face library to easily load, preprocess, and manage NLP datasets |
TinyTransformerModel
We're building a very small custom transformer from scratch using PyTorch and wrapping it with Hugging Faceβs PreTrainedModel
interface so we can use it with the Trainer
.
We'll use the model with dummy data to show what each stage is doing.
self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
input_ids
= [0, 2, 4]
(seq_len, hidden_size)
Example: ```python inputids = torch.tensor([[1, 3, 5]]) # batchsize=1, seqlen=3 embedded = model.embedding(inputids)
print(embedded.shape) # torch.Size([1, 3, 64]) ```
This converts token IDs to dense vectors (learnable word representations).
encoder_layer = nn.TransformerEncoderLayer(config.hidden_size, config.num_attention_heads)
self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
(seq_len, hidden_size)
β οΈ PyTorchβs Transformer expects input shape as
(seq_len, batch_size, hidden_size)
So we need:
python
x = embedded.transpose(0, 1) # Make shape [seq_len, batch, hidden]
encoded = self.encoder(x)
self.fc = nn.Linear(config.hidden_size, config.vocab_size)
Shape:
python
logits = model.fc(encoded)
print(logits.shape) # (seq_len, batch, vocab_size)
Then we transpose it back to (batch, seqlen, vocabsize) for loss computation:
python
logits = logits.transpose(0, 1)
We use:
python
loss_fn = nn.CrossEntropyLoss()
logits.view(-1, vocab_size)
labels.view(-1)
def forward(self, input_ids, labels=None):
x = self.embedding(input_ids) # (B, S, H)
x = x.transpose(0, 1) # (S, B, H)
x = self.encoder(x) # (S, B, H)
x = self.fc(x) # (S, B, V)
x = x.transpose(0, 1) # (B, S, V)
loss = None
if labels is not None:
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(x.view(-1, self.config.vocab_size), labels.view(-1))
return {"loss": loss, "logits": x}
Layer | Input Shape | Output Shape |
---|---|---|
input_ids |
(B, S) | |
embedding |
(B, S) | (B, S, H) |
transpose |
(B, S, H) | (S, B, H) |
encoder |
(S, B, H) | (S, B, H) |
fc |
(S, B, H) | (S, B, V) |
transpose back |
(S, B, V) | (B, S, V) |
loss |
(B * S, V) + (B * S) | scalar |
Where: - B = Batch Size - S = Sequence Length - H = Hidden Size - V = Vocab Size
tokenizer("hello world", padding="max_length", max_length=10)
Gives:
- input_ids
: list of token ids (padded to length 10)
- labels
: same as input_ids
These go into the model for causal learning, where the model learns to predict the next token.
tokenizer
)π Converts text into numbers (token IDs)
"hello AI"
β [0, 1]
<pad>
) so all inputs are same length (e.g. 10)input_ids
, attention_mask
, etc.β This transforms human language into structured numerical inputs for the model.
self.embedding = nn.Embedding(vocab_size, hidden_size)
π Maps token IDs into dense vectors
input_ids = [0, 2, 5]
β 3 tokens[3, 64]
(if hidden_size = 64)β It creates the first interpretable numeric representation of the words for the model.
self.encoder = nn.TransformerEncoder(...)
π Applies self-attention and layer stacking
β Captures relationships between tokens regardless of position (great for understanding context).
self.fc = nn.Linear(hidden_size, vocab_size)
π Converts each hidden vector into logits over vocabulary
hidden_size = 64
and vocab_size = 10
:
β This layer generates the final predictions before applying softmax (in loss).
loss_fn = nn.CrossEntropyLoss()
π Compares modelβs predictions with the correct next word
bot β [0.2, 0.1, ..., 0.6]
(logits)9
β cross entropy compares the two and outputs a single loss numberβ The loss tells how "wrong" the model is β backpropagation will fix weights accordingly.
Trainer(model=model, args=..., train_dataset=...)
π Manages training loop: forward pass, backprop, optimizer, logging
β
Hugging Face Trainer
makes model training clean, scalable, and customizable.
π Ensures everything runs the same on any machine
β Reproducibility, sharability, and isolation.
Step | Analogy |
---|---|
Tokenization | Assigning IDs to words |
Embedding | Giving meaning to words |
Transformer Encoder | Understanding sentence context |
Linear Layer | Choosing the next word |
Loss | Grading the guess |
Trainer | The classroom session |
Docker | Your school bag (same everywhere) |
Absolutely! Here's a step-by-step guide to create, train, and fine-tune a tiny Transformer model from scratch using a fully Dockerized setup. This uses PyTorch + Hugging Face Transformers and is perfect for learning the full lifecycle of model building and training.
Weβll:
1. Create a tiny Transformer model from scratch.
2. Prepare a toy dataset.
3. Write training logic using Hugging Faceβs Trainer
.
4. Run everything inside a Docker container.
tiny-transformer/
βββ Dockerfile
βββ requirements.txt
βββ train.py
βββ model.py
βββ dataset.py
βββ data/
βββ sample.txt
Dockerfile
FROM python:3.10-slim
WORKDIR /app
# Install basic tools
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
# Install dependencies
COPY requirements.txt .
RUN pip install --upgrade pip && pip install -r requirements.txt
COPY . .
CMD ["python", "train.py"]
requirements.txt
torch
transformers
datasets
data/sample.txt
hello world
hello AI
hi machine
greetings bot
model.py
import torch.nn as nn
from transformers import PreTrainedModel, PretrainedConfig
class TinyConfig(PretrainedConfig):
def __init__(self, vocab_size=100, hidden_size=64, num_hidden_layers=2, num_attention_heads=2, **kwargs):
super().__init__(**kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
class TinyTransformerModel(PreTrainedModel):
config_class = TinyConfig
def __init__(self, config):
super().__init__(config)
self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
encoder_layer = nn.TransformerEncoderLayer(config.hidden_size, config.num_attention_heads)
self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
self.fc = nn.Linear(config.hidden_size, config.vocab_size)
def forward(self, input_ids, labels=None):
x = self.embedding(input_ids)
x = self.encoder(x)
logits = self.fc(x)
loss = None
if labels is not None:
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits.view(-1, self.config.vocab_size), labels.view(-1))
return {"loss": loss, "logits": logits}
dataset.py
from datasets import Dataset
from transformers import PreTrainedTokenizerFast
def load_dataset(tokenizer):
with open("data/sample.txt") as f:
lines = f.read().splitlines()
examples = [{"text": line} for line in lines]
def tokenize(ex):
tokens = tokenizer(ex["text"], padding="max_length", max_length=10, truncation=True, return_tensors="pt")
ex["input_ids"] = tokens["input_ids"][0]
ex["labels"] = tokens["input_ids"][0].clone()
return ex
ds = Dataset.from_list(examples)
ds = ds.map(tokenize)
return ds
train.py
from transformers import Trainer, TrainingArguments, PreTrainedTokenizerFast
from model import TinyTransformerModel, TinyConfig
from dataset import load_dataset
tokenizer = PreTrainedTokenizerFast(tokenizer_file=None)
tokenizer.add_tokens(["hello", "world", "AI", "hi", "machine", "greetings", "bot", "<pad>"])
tokenizer.pad_token = "<pad>"
config = TinyConfig(vocab_size=tokenizer.vocab_size)
model = TinyTransformerModel(config)
dataset = load_dataset(tokenizer)
args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=2,
num_train_epochs=10,
logging_dir="./logs",
logging_steps=1
)
trainer = Trainer(model=model, args=args, train_dataset=dataset)
trainer.train()
# Navigate to the project folder
cd tiny-transformer
# Build the Docker image
docker build -t tiny-transformer .
# Run the container
docker run --rm -it tiny-transformer
Youβll see your toy transformer training on simple text and logging the loss per epoch. You can later expand this to: - Use real datasets (e.g. from Hugging Face Hub). - Load pre-trained tokenizer. - Add eval loop and save checkpoints.