A step-by-step guide to designing, training, and evaluating a small transformer-based language model using entirely synthetic data — covering tokenization, attention mechanisms, training loops, and inference, explained from first principles.
Every major language model — GPT, LLaMA, Gemini — is a scaled-up version of the same ideas. By building something small enough to run on a laptop, you get to hold the whole system in your head at once. This article walks through every line of logic behind MeowLM — a transformer trained exclusively on synthetic conversational data to respond like a cat. The goal is not to build something production-ready. The goal is to understand, at an engineering level, what actually happens when you train a language model: how text becomes numbers, how those numbers flow through a network, how the network learns, and how you eventually get coherent text out the other end.
Fig. 01 — Language models are networks of learned connections — each layer refining meaning from the one before
The Complete Pipeline
Before touching any code, here is the full system and how each stage feeds into the next:
| Step | Phase | Details |
|---|---|---|
| 1 | Synthetic Data Generation | Template prompts × random responses |
| 2 | Tokenizer Training (BPE) | Text ➔ integer IDs (vocab size: 4096) |
| 3 | Dataset Loading | Batched tensors with causal shift (x, y) |
| 4 | Model Architecture (Transformer) | Embeddings ➔ Attention ➔ FFN ➔ Logits |
| 5 | Training Loop | Forward ➔ Loss ➔ Backprop ➔ Weight Update |
| 6 | Inference | Prompt ➔ Token-by-token generation |
| 7 | Evaluation | 15 hand-authored keyword coverage tests |
Each step depends on the one above it. A mistake in the tokenizer breaks the model. A bug in the data format breaks the tokenizer. Understanding each stage in isolation — and how they connect — is the whole point.
Part 1: Configuration — Deciding the Shape of Your Model
Before writing a single layer of the neural network, we need to decide its shape. This is what config.py handles. Think of it as a blueprint that every other part of the system reads from.
@dataclass
class MeowConfig:
vocab_size: int = 4096
max_seq_len: int = 128
d_model: int = 256
n_layers: int = 4
n_heads: int = 4
ffn_hidden: int = 512
dropout: float = 0.1
pad_id: int = 0
bos_id: int = 1 # <|im_start|>
eos_id: int = 2 # <|im_end|>
vocab_size = 4096 — The number of distinct tokens the model knows. Every possible output must be one of these 4096 tokens. Larger vocabularies cover more language with fewer tokens per sentence but require more memory. For a small experimental model, 4096 is sufficient.
max_seq_len = 128 — The model can only see 128 tokens at a time. This is its working memory — the context window. Real-world models use 4096, 8192, or even 1 million token windows. Longer windows are far more expensive because attention costs memory and time that grows with the square of sequence length.
d_model = 256 — Every token gets represented as a list of 256 numbers. This is called an embedding or hidden state. Think of it as a coordinate in a 256-dimensional space where position encodes meaning. Real models use 4096 or more. Larger means more expressive, but more computation.
n_layers = 4 — The model is 4 stacked transformer blocks. Each block reads the current token representations and updates them based on context. After 4 passes, each token's representation has been refined 4 times. Deeper networks can learn more complex patterns.
n_heads = 4 — Inside each block, multi-head attention runs 4 independent computations in parallel, each looking at the sequence from a different learned perspective. One head might specialize in tracking subjects, another verbs, another conversational tone. The results are then combined.
ffn_hidden = 512 — After attention, each token passes through a small two-layer feed-forward network with an intermediate size of 512. This is where per-token computation happens — transforming each position independently of the others.
dropout = 0.1 — During training only, 10% of neuron activations are randomly zeroed at each step. This prevents over-reliance on any single path through the network and helps the model generalize to inputs it hasn't seen before. At inference time, dropout is turned off.
The training configuration lives separately and controls the process of learning rather than the shape of the model:
@dataclass
class TrainConfig:
batch_size: int = 32
learning_rate: float = 3e-4
min_lr: float = 3e-5
weight_decay: float = 0.1
warmup_steps: int = 200
max_steps: int = 10000
eval_interval: int = 200
save_interval: int = 500
grad_clip: float = 1.0
learning_rate = 3e-4 controls how large each weight adjustment is. Too large and the model bounces around without converging; too small and training takes forever. 3e-4 is a reliable starting point for the AdamW optimizer on small models. The warmup_steps and min_lr settings describe a learning rate schedule — explained fully in the training section.
Part 2: Synthetic Data Generation — Teaching the Cat to Speak
Most real-world language models train on billions of words scraped from the internet. We have a more constrained and intentional setup: we generate exactly the data we want the model to learn from, with a known structure.
This approach is called synthetic data generation, and it is increasingly used in real AI development to control the distribution of training examples, avoid harmful content, or teach a model a specific behavior pattern.
The Template System
The data is built from vocabulary pools and topic generator functions:
FOODS = ["tuna", "chicken wet food", "gravy lovers beef", "salmon pate", ...]
GHOSTS = ["greebles", "a dust mote", "a shadow", "the invisible bug", ...]
gen_food_demand = _topic(
["are you hungry?", "do you want food?", "i just fed you!"],
[
"i am starving. feed me tuna.",
"the bowl is empty in the middle. this is an emergency.",
"i will perish if not fed immediately.",
],
"food_demand"
)
The _topic function returns a callable. When called, it picks a random human prompt paired with a random cat response. This callable is invoked thousands of times, producing a dataset with high lexical diversity (no two examples are identical) but a controlled distribution (every topic appears roughly equally).
Why Synthetic Data?
Control. We know exactly what the model will see. No ambiguity about contradictory examples, offensive content, or confusing patterns.
Coverage. Every topic — food, zoomies, bath time, vet visits — appears the same number of times. If we scraped real cat text from the internet, some topics would dominate and others would be rare, leading to uneven performance.
Reproducibility. random.seed(42) at the top means the same dataset is generated every run. This is critical for debugging and comparing training runs.
The Conversation Format
Each generated sample is formatted into a structured template before being written to disk:
def format_sample(s):
return (
f"<|im_start|>user\n{s['input']}<|im_end|>\n"
f"<|im_start|>assistant\n{s['output']}<|im_end|>"
)
This produces text like:
<|im_start|>user
are you hungry?<|im_end|>
<|im_start|>assistant
i am starving. feed me tuna.<|im_end|>
The special tokens <|im_start|> and <|im_end|> are structural markers that tell the model where turns begin and end. After thousands of examples formatted this way, the model learns that <|im_start|>assistant is the signal to begin generating a response.
The format is not metadata. It is part of the language the model learns. It teaches the model the difference between a prompt and a response.
This is the same technique used in ChatGPT-style instruction tuning — the structure of the data teaches the model its role.
Train / Eval Split
n_eval = int(len(samples) * eval_ratio) # 5% of total
eval_samples, train_samples = samples[:n_eval], samples[n_eval:]
5% of samples go to a held-out evaluation set that is never shown to the model during training. This gives an honest measure of whether the model learned general patterns or just memorized specific examples — the distinction between training loss and generalization.
Part 3: Tokenization — Turning Text Into Numbers
Neural networks operate on numbers, not strings. Before any text can be processed by the model, it must be converted into a sequence of integers by a tokenizer.
Fig. 02 — Tokenization transforms human-readable text into numerical sequences a model can process
What is a Tokenizer?
A tokenizer is a bidirectional mapping between text and integer IDs:
"i am starving" → [42, 891, 1203]
[42, 891, 1203] → "i am starving"
The tokenizer has a fixed vocabulary — a dictionary of all known tokens and their integer IDs. The model only knows about tokens in this dictionary.
Byte-Pair Encoding (BPE)
We use Byte-Pair Encoding, the same algorithm underlying GPT-2 and GPT-4's tokenizers. Here is how it builds the vocabulary:
Step 0: Start with individual characters as vocabulary
"meow" → ['m', 'e', 'o', 'w']
Step 1: Find the most frequent adjacent pair across all training text
Suppose 'e'+'o' appears most often → merge into 'eo'
"meow" → ['m', 'eo', 'w']
Step 2: Repeat until vocab reaches target size (4096)
Common words become single tokens:
"meow" → [1847] (appeared thousands of times)
"<|im_start|>" → [1] (reserved)
"asdfghjkl" → [42, 18, 3, ...] (rare — stays split)
Frequent sequences get compressed into single tokens, making the model's job easier. Rare character combinations stay broken into smaller pieces that are still in the vocabulary. The model never encounters a completely unknown input.
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(
vocab_size=4096,
special_tokens=["<pad>", "<|im_start|>", "<|im_end|>"],
min_frequency=2,
)
tokenizer.train_from_iterator(texts, trainer)
ByteLevel pre-tokenization maps every byte (including spaces and special characters) to a printable unicode character before BPE runs. This guarantees the tokenizer can handle any input without errors — every possible byte has a known representation.
Special Tokens and Their Reserved IDs
ID 0 → <pad> Padding — fills batches to equal length
ID 1 → <|im_start|> Start of a conversation turn
ID 2 → <|im_end|> End of a turn / generation stop signal
These positions are reserved regardless of how often they appear. During training, loss is not computed at padding positions (ignore_index=0). During generation, producing ID 2 tells the model to stop — this is literally how the model learns when a response is complete.
Part 4: The Dataset Loader — Feeding Batches to the Model
The Core Insight: Next-Token Prediction
Language models train by predicting the next token given all previous tokens. This is called causal language modeling. Every position in the sequence is simultaneously an input and a supervision target:
Full sequence: [<im_start>, user, \n, hungry, ?, <im_end>, \n, <im_start>, assistant, \n, starving, <im_end>]
Input (x): [<im_start>, user, \n, hungry, ?, <im_end>, \n, <im_start>, assistant, \n, starving ]
Target (y): [user, \n, hungry, ?, <im_end>, \n, <im_start>, assistant, \n, starving, <im_end> ]
x = ids[:-1] # Everything but the last token
y = ids[1:] # Everything but the first token (shifted one position right)
The model sees x and must predict y at every position simultaneously. Every token — including structural tokens like <|im_end|> — is something the model trains to predict. This is why the model learns when to stop: it sees thousands of examples where <|im_end|> follows the last word of the response.
Padding and Batching
Training processes 32 sequences simultaneously. Sequences have different lengths. To process them as a single rectangular matrix (which vectorized operations require), shorter sequences are padded with zeros:
def collate_fn(batch, pad_id=0):
max_len = max(len(x) for x, _ in batch)
padded_x = torch.full((len(batch), max_len), pad_id, dtype=torch.long)
padded_y = torch.full((len(batch), max_len), pad_id, dtype=torch.long)
for i, (x, y) in enumerate(zip(xs, ys)):
padded_x[i, :len(x)] = x # real tokens fill the front
padded_y[i, :len(y)] = y # zeros fill the remainder
return padded_x, padded_y
Batch of 3 sequences padded to max length 5:
[42, 891, 1203, 77, 2] ← length 5, no padding needed
[42, 14, 0, 0, 0] ← length 2, padded with 3 zeros
[ 1, 88, 203, 2, 0] ← length 4, padded with 1 zero
Padding zeros in the target tensor are ignored during loss calculation — the model is not penalized for its predictions at meaningless padding positions, and those positions contribute nothing to the gradient.
Part 5: The Model — How a Transformer Works
This is the heart of the system. The model in model.py is a decoder-only transformer — the same architecture used in the GPT family. Here is the full data flow:
Input Token IDs: [1, 42, 891, 1203, 2]
│
┌─────────▼──────────┐
│ Token Embedding │ 4096 × 256 lookup table
│+ Positional Embed │ 128 × 256 lookup table
└─────────┬──────────┘
│ shape: [Batch, Seq_len, 256]
│
┌─────────▼──────────┐ ┐
│ LayerNorm │ │
│ Multi-Head │ │
│ Attention (×4) │ │ × 4 Transformer
│ + Residual │ │ Blocks
├────────────────────┤ │
│ LayerNorm │ │
│ FFN 256→512→256 │ │
│ + Residual │ │
└─────────┬──────────┘ ┘
│
┌─────────▼──────────┐
│ Final LayerNorm │
│ LM Head Linear │ 256 → 4096
└─────────┬──────────┘
│
Logits: [Batch, Seq_len, 4096]
One score per vocab token, per position
Embeddings — Giving Tokens a Location in Space
self.tok_emb = nn.Embedding(config.vocab_size, config.d_model)
self.pos_emb = nn.Embedding(config.max_seq_len, config.d_model)
pos = torch.arange(T, device=idx.device)
x = self.drop(self.tok_emb(idx) + self.pos_emb(pos))
nn.Embedding is a lookup table. Given an integer ID, it returns a row from a learned matrix. The token embedding (4096 × 256) encodes what a token means. The positional embedding (128 × 256) encodes where it appears in the sequence. Both are summed element-wise.
The positional embedding is critical because attention — the next step — operates on sets and has no inherent sense of order. Without positional embeddings, "the cat sat" and "sat the cat" would look identical to the model.
The Causal Mask — Preventing the Model from Cheating
mask = torch.tril(torch.ones(T, T, device=idx.device)).unsqueeze(0).unsqueeze(0)
torch.tril creates a lower-triangular matrix of ones:
Positions: 0 1 2 3
0 [1, 0, 0, 0] ← token 0 can only see itself
1 [1, 1, 0, 0] ← token 1 can see positions 0–1
2 [1, 1, 1, 0] ← token 2 can see positions 0–2
3 [1, 1, 1, 1] ← token 3 can see all prior tokens
Zeros become -inf before the softmax step, which drives their attention weight to exactly zero. The model cannot peek at future tokens. This is critical: at inference time, future tokens don't exist yet — the model generates one token at a time. Training with the causal mask faithfully simulates that left-to-right process.
Multi-Head Attention — How Tokens Talk to Each Other
Attention is the mechanism that allows every token to gather information from other tokens. This is the defining operation of the transformer — every token can directly communicate with every other visible token in a single step.
For each of 4 attention heads (each with 64 dimensions):
Token i asks: "What do I need?" → Query Q_i [64-dim]
Token j says: "Here's what I contain" → Key K_j [64-dim]
Token j offers: "Here's what I give you" → Value V_j [64-dim]
Score(i→j) = dot(Q_i, K_j) / sqrt(64) ← how relevant is j to i?
Weight(i→j) = softmax( Score(i→j) ) ← normalized attention weight
Output_i = Σⱼ Weight(i→j) × V_j ← weighted sum of all values
def forward(self, x, mask=None):
B, T, C = x.shape
qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2]
attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
if mask is not None:
attn = attn.masked_fill(mask == 0, float("-inf"))
attn = self.dropout(F.softmax(attn, dim=-1))
return self.out((attn @ v).transpose(1, 2).contiguous().view(B, T, C))
The division by sqrt(64) = 8 is not cosmetic. Dot products grow large as dimension increases, pushing softmax toward one-hot distributions and zeroing out gradients — making learning impossible. Scaling by the square root of dimension prevents this.
All 4 heads run in parallel, each with its own learned Q, K, V projections. Each can specialize in different aspects of the sequence. The outputs are concatenated and projected back to d_model=256.
Feed-Forward Network — Per-Token Processing
class FFN(nn.Module):
def forward(self, x):
return self.dropout(self.down(F.relu(self.up(x))))
After attention, each token passes independently through a two-layer network: 256 → 512 → 256. F.relu applies a non-linearity (zeroes out negative values), giving the network the ability to learn non-linear patterns.
Attention and FFN are complementary. Attention decides which tokens are relevant to which — routing information across positions. The FFN decides what to do with that information — transforming it at each position independently. Both together are why each transformer block is more expressive than either mechanism alone.
Residual Connections and Layer Normalization
class Block(nn.Module):
def forward(self, x, mask=None):
x = x + self.attn(self.norm1(x), mask) # Residual
x = x + self.ffn(self.norm2(x)) # Residual
return x
Residual connections (x = x + ...) create a shortcut path through the network. Gradients can flow directly from the output back to the earliest layers without passing through every intermediate computation. Without residuals, training networks deeper than 2–3 layers is extremely difficult because the gradient signal weakens with each layer it passes through — the vanishing gradient problem. Residuals solve this, making even 100-layer networks trainable.
Layer Normalization (applied before each sub-layer — Pre-LN) normalizes each token's representation to have mean 0 and standard deviation 1 across the feature dimension. This keeps activations in a numerically stable range throughout training, preventing them from exploding to very large numbers or collapsing toward zero.
The Language Model Head and Weight Tying
self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
self.lm_head.weight = self.tok_emb.weight # tie weights
The final linear layer projects each token's 256-dimensional representation to 4096 numbers — one for each vocabulary entry. These are logits. Higher logits mean more probability assigned to that token being next.
The weight matrix of lm_head is set to be identical to the token embedding matrix. The reasoning: the same geometry that encodes "what does token 42 mean as input" should encode "how much does the current state resemble token 42 as output." This reduces parameter count and consistently improves language modeling performance on small models, because every gradient update that refines how a token is represented as input also refines how it is recognized as output.
Computing the Loss
loss = F.cross_entropy(
logits.view(-1, self.config.vocab_size),
targets.view(-1),
ignore_index=0,
)
Cross-entropy loss measures how well the predicted probability distribution matches the actual next token. If the correct next token is ID 891 and the model assigns it 90% probability, the loss is low (~0.1). If the model assigns it 0.01% probability (which happens early in training when weights are random), the loss is high (~9.2). ignore_index=0 ensures padding positions contribute nothing to the loss.
Text Generation
@torch.no_grad()
def generate(self, idx, max_new_tokens=64, temperature=0.7, top_k=50):
for _ in range(max_new_tokens):
idx_cond = idx[:, -self.config.max_seq_len:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float("-inf")
probs = F.softmax(logits, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_id], dim=1)
if next_id.item() == self.config.eos_id:
break
return idx, []
Generation is a loop where each iteration adds one token:
- Feed the current sequence into the model and read logits from the last position only — that is the prediction for the next token.
- Temperature (
0.7): Divide all logits by 0.7 before softmax. Values below 1.0 sharpen the distribution — the most probable tokens get relatively more weight. Values above 1.0 flatten it, producing more random and creative output. Temperature 1.0 leaves the distribution unchanged. - Top-K (
k=50): Zero out the probability of all tokens outside the top 50. This prevents the model from ever generating very low-probability gibberish. We then sample from the remaining distribution rather than always taking the single most probable token, which would produce deterministic, repetitive output. - One token is sampled and appended to the sequence. If the model generates
<|im_end|>(ID 2), the loop stops.
Part 6: The Training Loop — Learning From Data
Fig. 03 — Training is an iterative process of measuring error and adjusting weights — thousands of small corrections accumulating into capability
The Optimizer: AdamW
optimizer = torch.optim.AdamW(
model.parameters(), lr=tc.learning_rate,
weight_decay=tc.weight_decay, betas=(0.9, 0.95),
)
AdamW is the optimizer of choice for nearly all modern language models. For each parameter, it tracks two running statistics:
- Momentum (
beta1=0.9): A running average of the gradient direction. This smooths out noisy gradient estimates and accelerates movement in consistent directions. - Variance (
beta2=0.95): A running average of the squared gradient magnitude. Parameters with large, consistent gradients get smaller steps; parameters with small or noisy gradients get larger steps.
The update is proportional to gradient / sqrt(variance) — each parameter effectively gets its own adaptive learning rate. weight_decay=0.1 adds a penalty that pushes all weights toward zero unless the gradient evidence against it is strong. This regularization prevents any single weight from growing very large, which often leads to overfitting.
The Learning Rate Schedule
def get_lr(step, config):
if step < config.warmup_steps:
return config.learning_rate * step / config.warmup_steps
progress = (step - config.warmup_steps) / max(1, config.max_steps - config.warmup_steps)
coeff = 0.5 * (1 + math.cos(math.pi * progress))
return config.min_lr + (config.learning_rate - config.min_lr) * coeff
We don't use a constant learning rate throughout training. Instead:
xychart-beta
title "Learning Rate Schedule (x 10⁻⁵)"
x-axis ["0", "200", "1k", "2k", "3k", "4k", "5k", "6k", "7k", "8k", "9k", "10k", " "]
y-axis 0 --> 30
line [0, 30, 28, 25, 21, 17, 13, 9, 6, 4, 3, 3, 3]
Warmup (steps 0–200): The learning rate starts at zero and ramps linearly to its maximum. At the very start, model weights are random and gradients are noisy and large. Taking large steps before the optimizer has built accurate momentum estimates can send the model to a bad region it never escapes. Warming up gradually gives the optimizer time to calibrate.
Cosine decay (steps 200–10,000): The learning rate follows a cosine curve from maximum down to min_lr = 3e-5. The cosine shape decays slowly at first, steeply in the middle, and slowly at the end — allowing fine adjustments as the model converges to a good solution.
Gradient Clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), tc.grad_clip)
Before each optimizer step, we compute the total magnitude of all gradients across all parameters. If it exceeds grad_clip = 1.0, all gradients are scaled down proportionally so the total magnitude is exactly 1.0.
This prevents gradient explosions — occasional training steps where the loss surface is very steep and computed gradients are enormous, causing parameters to jump to a completely different and worse region. Clipping acts as a hard safety cap on the maximum size of any single update.
Mixed Precision Training
use_amp = device.type == "cuda"
scaler = torch.amp.GradScaler("cuda") if use_amp else None
On NVIDIA GPUs, training uses automatic mixed precision (AMP). Modern GPU hardware runs 16-bit float arithmetic roughly twice as fast as 32-bit. AMP automatically runs the forward pass in 16-bit where safe, keeping master weights and optimizer states in 32-bit for numerical precision.
The GradScaler solves a subtle problem: 16-bit floats can underflow (round to zero) for very small gradients. The scaler multiplies the loss by a large factor before backpropagation, computes gradients in that scaled domain, then unscales them before the optimizer step — preventing underflow without affecting the final parameter updates.
Checkpointing
if el < best_eval:
torch.save({
"step": step,
"model_state_dict": model.state_dict(),
"config": vars(mc),
"eval_loss": el,
}, os.path.join(tc.output_dir, "best_model.pt"))
print(f" -> Best model saved (eval={el:.4f})")
Every 200 steps, the model is evaluated on the held-out set. We save a checkpoint whenever it achieves a new best evaluation loss. This matters because the model at the final training step is not necessarily the best one — eval loss can increase near the end as the model begins to overfit. By tracking the best checkpoint, we always retain the most generalizable version.
The checkpoint saves the configuration alongside the weights. This is essential: you cannot reconstruct the correct architecture without knowing the hyperparameters (d_model, n_layers, etc.) used to build it.
Part 7: Inference — Talking to the Trained Model
MeowInference wraps loading and generation into a single clean interface that mimics the OpenAI chat completions API:
engine = MeowInference("checkpoints/best_model.pt", "data/tokenizer.json")
result = engine.chat_completion([{"role": "user", "content": "are you hungry?"}])
print(result["choices"][0]["message"]["content"])
# → "i am starving. the bowl is almost empty. this is an emergency."
Loading a Checkpoint
ckpt = torch.load(checkpoint_path, map_location=self.device, weights_only=False)
self.model = MeowLM(self.config).to(self.device)
self.model.load_state_dict(filtered)
self.model.eval()
map_location=self.device ensures that even if the model was trained on a GPU, weights load onto whatever device is available at inference time — CPU, GPU, or Apple Silicon. A fresh MeowLM (all random weights) is constructed, then load_state_dict copies the saved parameter values in. After this, the model has exactly the weights it had at the best training checkpoint. eval() disables dropout, which should only run during training.
Formatting the Prompt
def _format_prompt(self, messages):
parts = []
for msg in messages:
parts.append(f"<|im_start|>{msg['role']}\n{msg['content']}<|im_end|>")
parts.append("<|im_start|>assistant\n") # ← intentionally left open
return "\n".join(parts)
The prompt must be formatted exactly the way training data was formatted. The model learned to associate <|im_start|>assistant\n with the beginning of a response. The prompt ends with that partial sequence — an open invitation for the model to continue from. The final <|im_start|>assistant\n has no closing <|im_end|>. We give the model the start of the assistant turn and let it complete it.
Post-Processing
if "<|im_end|>" in output_text:
output_text = output_text.split("<|im_end|>")[0]
if "<|im_start|>" in output_text:
output_text = output_text.split("<|im_start|>")[0]
resp_text = output_text.strip()
The raw generation includes the stop token and possibly the beginning of a new turn. We strip everything from <|im_end|> onward and return only the clean response.
Part 8: Evaluation — Measuring What the Model Learned
The Evaluation Strategy
EVAL_CASES = [
{
"id": "food_demands",
"prompt": "are you hungry?",
"expect_keywords": ["food", "starving", "bowl", "empty", "now", "meow", ...],
"expect_style": "dramatic, always starving",
},
# ... 14 more cases
]
For each case, we generate a response and check whether at least one expected keyword appears:
found_keywords = [kw for kw in expect_keywords if kw.lower() in response]
success = len(found_keywords) > 0
This is a keyword coverage test, and the choice is deliberate. A cat responding to "are you hungry?" could say "yes, feed me" or "the bowl is empty" or "i am wasting away" — all are correct. Demanding a specific phrasing would make the evaluation too brittle. Keyword presence checks whether the model understood the concept of the prompt without prescribing the exact wording.
The 15 test categories map exactly to the training data topics:
| Category | Prompt | Expected Style |
|---|---|---|
| greeting | "psst psst psst" |
indifferent or demanding |
| food_demand | "are you hungry?" |
dramatic, always starving |
| food_disappoint | "i filled your bowl" |
picky, ungrateful |
| bath | "time for a bath" |
horrified, defensive |
| zoomies | "why running at 3am?" |
unhinged, fast-paced |
| box | "i bought you a new bed" |
prefers the cardboard box |
| petting | "can i rub your belly?" |
warning of a trap |
| staring | "what are you staring at?" |
mysterious, intense |
| keyboard | "i'm trying to type" |
entitled, attention-seeking |
| door | "do you want to go out?" |
indecisive, controlling |
| sleeping | "wake up!" |
grumpy, demands more sleep |
| plants | "stop eating my monstera!" |
unapologetic, destructive |
| vet | "time to go to the vet" |
terrified, evasive |
| gifts | "what is that in your mouth?" |
proud, condescending |
| laser | "look at the red dot" |
obsessed, hyper-focused |
If the model passes all 15, it has demonstrably learned each behavioral pattern explicitly programmed into the training data generator.
Part 9: Orchestration — Running the Whole Pipeline
run_all.py ties everything together with a single command and supports resuming at any stage:
# Full pipeline: generate data → train → evaluate
python run_all.py
# Skip data generation and training, evaluate only
python run_all.py --skip-data --skip-train
# Use a smaller dataset for quick experiments
python run_all.py --samples 60000
# Run on GPU
python run_all.py --device cuda
The --skip-* flags reflect a practical reality: data generation and tokenizer training are expensive operations you don't want to repeat every time you tweak a training hyperparameter. The pipeline is designed to be resumable at any stage.
The fallback logic in evaluation is a defensive pattern for interrupted training runs:
fallback_path = os.path.join(os.path.dirname(checkpoint_path), "final_model.pt")
if os.path.exists(fallback_path):
checkpoint_path = fallback_path
logger.info(f"Using fallback checkpoint: {checkpoint_path}")
If training was interrupted before the best checkpoint was saved, evaluation falls back to the most recent final_model.pt. This prevents the entire pipeline from failing on a missing file when a usable model exists nearby.
The Complete Picture
Looking at the system as a whole, several design choices work together to make learning possible:
Consistent formatting creates learnable structure. The <|im_start|>...<|im_end|> template appears in every training example. The model sees this structure tens of thousands of times and learns to associate <|im_start|>assistant\n with the cue to generate a response. This is pattern matching at scale — not magic.
The tokenizer and model are trained on the same distribution. The BPE vocabulary is optimally suited to the actual inputs — common phrases in cat conversations become single tokens, minimizing sequence length and making the model's job easier.
The causal mask enforces the generation contract. Training and inference are consistent: the model learns to predict the next token without seeing the future, and at inference time it generates one token at a time. There is no gap between how the model trains and how it generates.
Weight tying improves sample efficiency. Every gradient update that changes how a token is represented as input also changes how it is recognized as output, doubling the effective supervision signal for the vocabulary.
Held-out evaluation prevents false confidence. The 5% eval split gives an honest diagnostic of whether the model is generalizing or memorizing. The gap between training loss and eval loss tells you which problem you have.
Scaling Up
MeowLM has roughly 4 million parameters. Here is how it compares to real-world models:
| Model | d_model | Layers | Heads | Parameters |
|---|---|---|---|---|
| MeowLM | 256 | 4 | 4 | ~4M |
| GPT-2 Small | 768 | 12 | 12 | 117M |
| LLaMA-3 8B | 4,096 | 32 | 32 | 8B |
| GPT-4 (est.) | ~12,288 | 96+ | 96+ | ~1.8T |
The core architecture — embeddings, multi-head attention, FFN, residuals, layer norm, next-token prediction — is essentially identical at every scale. What changes is the size of each component and the volume of training data. The engineering challenges at scale are immense: distributed training across thousands of GPUs, gradient synchronization, memory optimization to fit large models in GPU VRAM, numerical stability over billions of steps. But the conceptual machinery is exactly what we built here.
Every transformer-based language model in existence is, at its core, the same system: text becomes tokens, tokens become vectors, vectors attend to each other, feed-forward layers transform them, and the model learns to predict what comes next. MeowLM does all of this — just for a very specific kind of text, at a scale that fits in your head.
Running the Code
# Install dependencies
pip install torch tokenizers
# Run the full pipeline
python run_all.py
# Or step by step
python prepare_data.py # Generate data + train tokenizer (~5 min)
python train.py # Train the model (~35 min on CPU)
python inference.py # Chat interactively with the trained cat
Expected training output:
Step | LR | Train | Eval | Time
------------------------------------------------------
200 | 0.000275 | 3.2144 | 3.1892 | 45.2s
400 | 0.000295 | 2.8733 | 2.7541 | 91.4s
2000 | 0.000285 | 1.8921 | 1.9102 | 430.1s
5000 | 0.000182 | 1.4233 | 1.4891 | 1075.0s
10000 | 0.000031 | 1.2441 | 1.3102 | 2100.0s
Done! 2100s, best eval: 1.2983
A training loss around 1.2–1.5 indicates the model has learned meaningful patterns. Below 1.0 often signals near-memorization of the training set. The slight gap between train and eval loss is expected and healthy — the model has generalized but not overfit.