transformer architecture deeply understand


🧩 Library Overview

Library Purpose
torch Core deep learning library (PyTorch) used for building and training neural networks
transformers Hugging Face library that provides utilities to build, train, and load state-of-the-art transformer models
datasets Hugging Face library to easily load, preprocess, and manage NLP datasets

🧠 Model Overview: TinyTransformerModel

We're building a very small custom transformer from scratch using PyTorch and wrapping it with Hugging Face’s PreTrainedModel interface so we can use it with the Trainer.

πŸ” The Architecture Has These Parts:

  1. Embedding Layer
  2. TransformerEncoder (multi-layer)
  3. Final Linear Output Layer
  4. Loss Function (CrossEntropyLoss)

πŸ§ͺ Let's Understand Each Component with I/O and Examples

We'll use the model with dummy data to show what each stage is doing.


✨ Step 1: Embedding Layer

self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)

Example: ```python inputids = torch.tensor([[1, 3, 5]]) # batchsize=1, seqlen=3 embedded = model.embedding(inputids)

print(embedded.shape) # torch.Size([1, 3, 64]) ```

This converts token IDs to dense vectors (learnable word representations).


✨ Step 2: Transformer Encoder

encoder_layer = nn.TransformerEncoderLayer(config.hidden_size, config.num_attention_heads)
self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)

⚠️ PyTorch’s Transformer expects input shape as (seq_len, batch_size, hidden_size)

So we need: python x = embedded.transpose(0, 1) # Make shape [seq_len, batch, hidden] encoded = self.encoder(x)


✨ Step 3: Output Linear Layer

self.fc = nn.Linear(config.hidden_size, config.vocab_size)

Shape: python logits = model.fc(encoded) print(logits.shape) # (seq_len, batch, vocab_size)

Then we transpose it back to (batch, seqlen, vocabsize) for loss computation: python logits = logits.transpose(0, 1)


✨ Step 4: Loss Function

We use: python loss_fn = nn.CrossEntropyLoss()


πŸ” Complete Forward Pass Flow

def forward(self, input_ids, labels=None):
    x = self.embedding(input_ids)                        # (B, S, H)
    x = x.transpose(0, 1)                                # (S, B, H)
    x = self.encoder(x)                                  # (S, B, H)
    x = self.fc(x)                                       # (S, B, V)
    x = x.transpose(0, 1)                                # (B, S, V)
    
    loss = None
    if labels is not None:
        loss_fn = nn.CrossEntropyLoss()
        loss = loss_fn(x.view(-1, self.config.vocab_size), labels.view(-1))

    return {"loss": loss, "logits": x}

πŸ”¬ Summary: Shapes at Each Stage

Layer Input Shape Output Shape
input_ids (B, S)
embedding (B, S) (B, S, H)
transpose (B, S, H) (S, B, H)
encoder (S, B, H) (S, B, H)
fc (S, B, H) (S, B, V)
transpose back (S, B, V) (B, S, V)
loss (B * S, V) + (B * S) scalar

Where: - B = Batch Size - S = Sequence Length - H = Hidden Size - V = Vocab Size


πŸ” Dataset Flow

tokenizer("hello world", padding="max_length", max_length=10)

Gives: - input_ids: list of token ids (padded to length 10) - labels: same as input_ids

These go into the model for causal learning, where the model learns to predict the next token.



🧠 STEP-BY-STEP EXPLANATION (Purpose & Logic)


πŸ”Ή 1. Tokenization (tokenizer)

πŸ”„ Converts text into numbers (token IDs)

βœ… This transforms human language into structured numerical inputs for the model.


πŸ”Ή 2. Embedding Layer

self.embedding = nn.Embedding(vocab_size, hidden_size)

πŸ”„ Maps token IDs into dense vectors

βœ… It creates the first interpretable numeric representation of the words for the model.


πŸ”Ή 3. Transformer Encoder

self.encoder = nn.TransformerEncoder(...)

πŸ”„ Applies self-attention and layer stacking

βœ… Captures relationships between tokens regardless of position (great for understanding context).


πŸ”Ή 4. Feedforward Output Layer

self.fc = nn.Linear(hidden_size, vocab_size)

πŸ”„ Converts each hidden vector into logits over vocabulary

βœ… This layer generates the final predictions before applying softmax (in loss).


πŸ”Ή 5. Loss Function (Cross Entropy)

loss_fn = nn.CrossEntropyLoss()

πŸ”„ Compares model’s predictions with the correct next word

βœ… The loss tells how "wrong" the model is β€” backpropagation will fix weights accordingly.


πŸ”Ή 6. Trainer

Trainer(model=model, args=..., train_dataset=...)

πŸ”„ Manages training loop: forward pass, backprop, optimizer, logging

βœ… Hugging Face Trainer makes model training clean, scalable, and customizable.


πŸ”Ή 7. Dockerization

πŸ”„ Ensures everything runs the same on any machine

βœ… Reproducibility, sharability, and isolation.


πŸ” Analogy: Model as a Student Learning a Language

Step Analogy
Tokenization Assigning IDs to words
Embedding Giving meaning to words
Transformer Encoder Understanding sentence context
Linear Layer Choosing the next word
Loss Grading the guess
Trainer The classroom session
Docker Your school bag (same everywhere)

RUNNING

Absolutely! Here's a step-by-step guide to create, train, and fine-tune a tiny Transformer model from scratch using a fully Dockerized setup. This uses PyTorch + Hugging Face Transformers and is perfect for learning the full lifecycle of model building and training.


πŸ› οΈ Overview

We’ll: 1. Create a tiny Transformer model from scratch. 2. Prepare a toy dataset. 3. Write training logic using Hugging Face’s Trainer. 4. Run everything inside a Docker container.


πŸ“ Project Structure

tiny-transformer/
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ train.py
β”œβ”€β”€ model.py
β”œβ”€β”€ dataset.py
└── data/
    └── sample.txt

🐳 Step 1: Create Dockerfile

FROM python:3.10-slim

WORKDIR /app

# Install basic tools
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*

# Install dependencies
COPY requirements.txt .
RUN pip install --upgrade pip && pip install -r requirements.txt

COPY . .

CMD ["python", "train.py"]

πŸ“¦ Step 2: requirements.txt

torch
transformers
datasets

πŸ“Š Step 3: Create a Tiny Dataset

data/sample.txt

hello world
hello AI
hi machine
greetings bot

🧠 Step 4: Define a Tiny Transformer

model.py

import torch.nn as nn
from transformers import PreTrainedModel, PretrainedConfig

class TinyConfig(PretrainedConfig):
    def __init__(self, vocab_size=100, hidden_size=64, num_hidden_layers=2, num_attention_heads=2, **kwargs):
        super().__init__(**kwargs)
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads

class TinyTransformerModel(PreTrainedModel):
    config_class = TinyConfig

    def __init__(self, config):
        super().__init__(config)
        self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
        encoder_layer = nn.TransformerEncoderLayer(config.hidden_size, config.num_attention_heads)
        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
        self.fc = nn.Linear(config.hidden_size, config.vocab_size)

    def forward(self, input_ids, labels=None):
        x = self.embedding(input_ids)
        x = self.encoder(x)
        logits = self.fc(x)
        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits.view(-1, self.config.vocab_size), labels.view(-1))
        return {"loss": loss, "logits": logits}

πŸ“š Step 5: Load the Dataset

dataset.py

from datasets import Dataset
from transformers import PreTrainedTokenizerFast

def load_dataset(tokenizer):
    with open("data/sample.txt") as f:
        lines = f.read().splitlines()

    examples = [{"text": line} for line in lines]

    def tokenize(ex):
        tokens = tokenizer(ex["text"], padding="max_length", max_length=10, truncation=True, return_tensors="pt")
        ex["input_ids"] = tokens["input_ids"][0]
        ex["labels"] = tokens["input_ids"][0].clone()
        return ex

    ds = Dataset.from_list(examples)
    ds = ds.map(tokenize)
    return ds

πŸ‹οΈ Step 6: Training Script

train.py

from transformers import Trainer, TrainingArguments, PreTrainedTokenizerFast
from model import TinyTransformerModel, TinyConfig
from dataset import load_dataset

tokenizer = PreTrainedTokenizerFast(tokenizer_file=None)
tokenizer.add_tokens(["hello", "world", "AI", "hi", "machine", "greetings", "bot", "<pad>"])
tokenizer.pad_token = "<pad>"

config = TinyConfig(vocab_size=tokenizer.vocab_size)
model = TinyTransformerModel(config)

dataset = load_dataset(tokenizer)

args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=10,
    logging_dir="./logs",
    logging_steps=1
)

trainer = Trainer(model=model, args=args, train_dataset=dataset)
trainer.train()

πŸš€ Step 7: Build & Run with Docker

# Navigate to the project folder
cd tiny-transformer

# Build the Docker image
docker build -t tiny-transformer .

# Run the container
docker run --rm -it tiny-transformer

βœ… Output

You’ll see your toy transformer training on simple text and logging the loss per epoch. You can later expand this to: - Use real datasets (e.g. from Hugging Face Hub). - Load pre-trained tokenizer. - Add eval loop and save checkpoints.