How to build a LLM from scratch (and what it teaches you)

Build a tiny LLM from scratch in Python and learn how tokenizers, attention, and inference work — knowledge that makes you better at AI API integration.

Ashley Innocent

Ashley Innocent

7 April 2026

How to build a LLM from scratch (and what it teaches you)

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Explore Apidog Enterprise

TL;DR

Building a minimal language model from scratch takes fewer than 300 lines of Python. The process reveals exactly how tokenization, attention, and inference work, which makes you a far better API consumer when you're integrating production LLMs into your applications.

Introduction

Most developers treat language models as black boxes. You send text in, tokens come out, and somewhere in between, magic happens. That mental model works fine until you need to debug a broken API integration, tune sampling parameters, or figure out why your model keeps hallucinating structured data.

GuppyLM, a project that recently hit the HackerNews front page with 842 points, makes the internals visible. It's a 8.7M parameter transformer written from scratch in Python. It trains in under an hour on a consumer GPU. The code fits in a single file. The goal isn't to compete with GPT-4; it's to demystify what LLMs actually do.

This article walks through how to build a tiny LLM, what each component does, and what understanding the internals teaches you when you're working with AI APIs professionally.

💡
If you're testing AI API integrations, Apidog's Test Scenarios let you verify streaming responses, assert on token structure, and simulate edge-case completions without burning production credits. More on that later
button

What makes a language model "tiny"?

A production LLM like GPT-4 has hundreds of billions of parameters. A "tiny" LLM sits in the range of 1M to 25M parameters. Projects like GuppyLM (8.7M), Karpathy's nanoGPT (124M), and MicroLM (1-2M) all fall into this category.

Tiny LLMs can: - Train on a laptop or Google Colab - Fit entirely in CPU memory - Be inspected, modified, and debugged at the weight level

They can't: - Handle complex reasoning - Generate coherent long-form text reliably - Match the factual depth of production models

The value isn't the output. It's the understanding you get from building one.

Core components: how an LLM actually works

Before writing any code, you need to know what the four main pieces do.

Tokenizer

The tokenizer converts raw text into integer IDs. "Hello, world!" becomes something like [15496, 11, 995, 0]. Each integer maps to a subword unit from a fixed vocabulary.

Why this matters for API work: token counts directly affect latency and cost. Understanding how tokenizers split text helps you write prompts that fit within context windows and avoid unexpected truncation.

GuppyLM uses a simple character-level tokenizer. Production models like GPT-4 use BPE (byte-pair encoding) with vocabularies of 50K-100K tokens.

Embedding layer

The embedding layer converts token IDs into dense vectors. Each token gets a learned vector (e.g. 384 dimensions in GuppyLM). These vectors carry semantic meaning: similar tokens end up close together in vector space.

Position embeddings are added on top, so the model knows token order.

Transformer blocks

This is the core computation. Each block has two parts:

Self-attention: lets each token look at all other tokens in the sequence and decide which ones matter for predicting the next token. GuppyLM uses 6 attention heads across 6 layers.

Feed-forward network: a two-layer MLP applied to each token's representation after attention. GuppyLM uses ReLU activation, which is simpler than the SwiGLU used in newer architectures.

Output head

After the final transformer block, a linear layer projects each token's representation to a vector of size equal to the vocabulary. Apply softmax to get probabilities, pick the most likely next token (or sample), and repeat.

Building a minimal LLM in Python

Here's a working minimal LLM based on the GuppyLM approach. This runs in standard PyTorch.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Hyperparameters
VOCAB_SIZE = 256     # character-level: one slot per ASCII char
D_MODEL = 128        # embedding dimension
N_HEADS = 4          # attention heads
N_LAYERS = 3         # transformer blocks
SEQ_LEN = 64         # context window
DROPOUT = 0.1

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(DROPOUT)

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.head_dim)
        q, k, v = qkv.unbind(dim=2)
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)
        # Causal mask: each token can only attend to previous tokens
        scale = self.head_dim ** -0.5
        attn = (q @ k.transpose(-2, -1)) * scale
        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        attn = attn.masked_fill(mask, float('-inf'))
        attn = F.softmax(attn, dim=-1)
        attn = self.dropout(attn)
        out = (attn @ v).transpose(1, 2).reshape(B, T, C)
        return self.proj(out)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = SelfAttention(d_model, n_heads)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model),
            nn.Dropout(DROPOUT),
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

class TinyLLM(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed = nn.Embedding(VOCAB_SIZE, D_MODEL)
        self.pos_embed = nn.Embedding(SEQ_LEN, D_MODEL)
        self.blocks = nn.ModuleList([
            TransformerBlock(D_MODEL, N_HEADS) for _ in range(N_LAYERS)
        ])
        self.ln_f = nn.LayerNorm(D_MODEL)
        self.head = nn.Linear(D_MODEL, VOCAB_SIZE, bias=False)

    def forward(self, idx):
        B, T = idx.shape
        tok_emb = self.embed(idx)
        pos = torch.arange(T, device=idx.device)
        pos_emb = self.pos_embed(pos)
        x = tok_emb + pos_emb
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.head(x)
        return logits

# Initialize and count parameters
model = TinyLLM()
total_params = sum(p.numel() for p in model.parameters())
print(f"Model size: {total_params:,} parameters")  # ~1.2M

Training loop

import torch.optim as optim

def train(model, data, epochs=100, lr=3e-4):
    optimizer = optim.AdamW(model.parameters(), lr=lr)
    model.train()
    for epoch in range(epochs):
        # data: tensor of token IDs, shape [batch, seq_len+1]
        x = data[:, :-1]   # input: all tokens except last
        y = data[:, 1:]    # target: all tokens shifted by 1
        logits = model(x)
        loss = F.cross_entropy(logits.reshape(-1, VOCAB_SIZE), y.reshape(-1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if epoch % 10 == 0:
            print(f"Epoch {epoch}, loss: {loss.item():.4f}")

Inference (text generation)

@torch.no_grad()
def generate(model, prompt_ids, max_new_tokens=50, temperature=1.0, top_k=10):
    model.eval()
    ids = torch.tensor([prompt_ids])
    for _ in range(max_new_tokens):
        idx_cond = ids[:, -SEQ_LEN:]  # crop to context window
        logits = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # last token only
        # top-k sampling
        v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
        logits[logits < v[:, [-1]]] = float('-inf')
        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        ids = torch.cat([ids, next_id], dim=1)
    return ids[0].tolist()

What this teaches you about AI API behavior

Building this reveals several things that make you a better API consumer.

Temperature and sampling are mechanical, not magical

Temperature divides logits before softmax. Higher temperature = flatter distribution = more random output. Lower temperature = sharper distribution = more deterministic output. When your production API returns inconsistent results with temperature=0.0, it's not a bug. True zero temperature is a greedy argmax, and many APIs floor it slightly to avoid degenerate outputs.

Context windows are hard limits, not soft suggestions

The idx_cond = ids[:, -SEQ_LEN:] line in the inference loop shows exactly what happens at the context limit. The model silently drops older tokens. If your API integration assumes the model remembers the full conversation history, it doesn't after a certain point. See [internal: how-ai-agent-memory-works] for how agents handle this problem.

Streaming tokens are just inference steps made visible

Streaming APIs don't do anything architecturally different. They run the inference loop and flush each token to the response stream as it's generated. Understanding this helps when you're writing retry logic: a dropped stream mid-generation can't be resumed, it must restart.

Logits explain why structured output is hard

The model assigns probability to every token in the vocabulary at each step. Generating valid JSON requires the right token to win at every position. Libraries like Outlines and Guidance constrain the logit distribution to enforce grammar at inference time. When you see AI APIs offering "structured output" modes, this is what they're doing internally.

How to test AI API integrations with Apidog

Once you understand how LLM inference works, you can write much better API tests. Apidog's Test Scenarios let you chain API calls and assert on the structure of AI responses.

For example, when testing a streaming chat API:

  1. Create a Test Scenario in Apidog with your /v1/chat/completions endpoint
  2. Set assertions to verify the response structure: response.choices[0].finish_reason == "stop", response.usage.total_tokens < 4096
  3. Add a follow-up step that sends the response as context to the next turn, simulating a multi-turn conversation
  4. Use Apidog's Smart Mock to stub the AI endpoint and test your app's error handling: simulate finish_reason: "length" (truncated output), finish_reason: "content_filter", and network timeout mid-stream

This is how you test AI integrations without burning API credits on every CI run. See [internal: api-testing-tutorial] for a broader look at API testing approaches.

Testing token count assertions

{
  "assertions": [
    {
      "field": "response.usage.completion_tokens",
      "operator": "less_than",
      "value": 512
    },
    {
      "field": "response.choices[0].finish_reason",
      "operator": "equals",
      "value": "stop"
    },
    {
      "field": "response.choices[0].message.content",
      "operator": "not_empty"
    }
  ]
}

Run this across multiple models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) in a single Test Scenario to catch API schema differences before they hit production.

Advanced: quantization and inference optimization

Once you have a working tiny LLM, two techniques are worth understanding because they apply directly to how production models are served.

Quantization

The weights in our model are 32-bit floats by default. Quantization reduces them to 8-bit integers (INT8) or even 4-bit (INT4). This cuts memory usage by 4-8x with modest accuracy loss.

# Example: dynamic INT8 quantization in PyTorch
import torch.quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)

Production APIs run quantized models. When you see different output quality at different "versions" of the same model, quantization is often involved.

KV cache

In our inference loop, we recompute attention across the full sequence every step. Production systems cache the key-value pairs from previous tokens (the KV cache) so each new token only needs one new attention computation. This is why the first token in a streaming response takes longer than subsequent ones.

Tiny LLM vs. production API: when to use each

Use case Tiny LLM Production API
Learning model internals Best for Overkill
Prototyping a new app Insufficient quality Best for
Private/sensitive data Good option Depends on provider
Offline/edge deployment Viable Not possible
Cost-sensitive, high volume Possible with tradeoffs Expensive at scale
Reasoning-heavy tasks Not viable Required

The real answer for most developers: use the production API for your application, but run a tiny model to understand what's happening under the hood. The two aren't competing. The [internal: open-source-coding-assistants-2026] article covers tools that blur this line with bring-your-own-model setups.

Conclusion

Building a tiny LLM from scratch takes a weekend. What you get isn't a production system; it's a working mental model of how every language model, from GuppyLM to GPT-4o, actually works. That understanding pays off every time you debug a streaming integration, tune sampling parameters, or design assertions for your AI API tests.

The GuppyLM project is a good starting point. Clone it, train it on any text dataset, and spend an afternoon reading the inference loop. Then go back to your production API integrations and you'll see them differently.

Try Apidog's Test Scenarios to bring the same rigor to your AI API testing that you'd apply to any other backend system.

button

FAQ

How many parameters does a "tiny" LLM need to generate coherent text?Around 10M-50M parameters with a decent training dataset can produce locally coherent sentences. Below 1M, you get gibberish on most tasks. GuppyLM at 8.7M works for short conversations on its training domain (60 topics).

Can I run a tiny LLM without a GPU?Yes. Models under 100M parameters run fine on CPU, though inference is slower. The model above (1.2M parameters) generates tokens in milliseconds on a laptop CPU.

What dataset should I train on?Character-level models work well with Project Gutenberg texts, Wikipedia subsets, or any plain text corpus. GuppyLM uses a 60K-entry conversation dataset on HuggingFace (arman-bd/guppylm-60k-generic). For code generation, use The Stack or CodeParrot.

What's the difference between temperature and top-k sampling?Temperature scales the logit distribution (controls overall randomness). Top-k restricts the sampling pool to the k most likely tokens before applying temperature. They're applied together: first top-k filters the candidates, then temperature shapes the probabilities within that set.

Why does my LLM sometimes repeat itself?Repetition is a failure mode where the model assigns high probability to tokens it just generated because they appeared in the context. Production APIs use repetition penalties (a logit adjustment that discounts recently generated tokens). Add repetition_penalty=1.1 in your API call to reduce this.

How long does it take to train a tiny LLM?The model above trains to coherent output in under 2 hours on a single GPU (RTX 3060 or equivalent). GuppyLM trains in Colab in roughly the same time. Larger models (100M+) need multi-GPU setups and days of training.

What's the fastest way to go from tiny LLM to a real API endpoint?Export to GGUF format using llama.cpp's conversion script, then serve with llama-server. This gives you an OpenAI-compatible API endpoint running locally. You can then point Apidog at it for testing, see [internal: rest-api-best-practices].

How do production LLMs handle context longer than their training window?Techniques like RoPE (Rotary Position Embedding) with extended scaling, sliding window attention, and retrieval-augmented generation all extend effective context. The core transformer architecture doesn't change; these are modifications to how position information is encoded and how the attention window is applied.

Explore more

How to Use Codex with GPT-5.5 for Free  (Unlimited Until August 2026)

How to Use Codex with GPT-5.5 for Free (Unlimited Until August 2026)

Pioneer.ai's Pro tier is unlimited until August 2026. Wire it to Codex CLI with five config flags and get GPT-5.5, Claude Opus, and DeepSeek for free.

1 June 2026

How to Use MiniMax M3 for Free: Open Weights and Low-Cost Access

How to Use MiniMax M3 for Free: Open Weights and Low-Cost Access

How to use MiniMax M3 for free: self-host the open weights, use free trials, and find the cheapest way to access M3's 1M-context coding model.

1 June 2026

How to Use the MiniMax M3 API?

How to Use the MiniMax M3 API?

How to use the MiniMax M3 API: get a key, make your first call, toggle thinking, handle 1M-token context, and test requests in Apidog.

1 June 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs