How to Train Your Own ChatGPT for $50?

Train your own GPT-2 level chatbot for $50 in 2 hours. Complete guide to nanochat with code examples, benchmarks, and step-by-step instructions.

Ashley Innocent

Ashley Innocent

19 March 2026

How to Train Your Own ChatGPT for $50?

TL;DR

nanochat is Andrej Karpathy’s open-source LLM training framework that lets you train a GPT-2 level chatbot for under $50 in about 2 hours. The project uses a single 8xH100 GPU node, minimal code (~500 lines for the core model), and one configuration dial (--depth) to automatically optimize all hyperparameters. Current records show training completion in 1.65 hours with a CORE score of 0.2626, beating OpenAI’s 2019 GPT-2 that cost $43,000 and took 168 hours.

Introduction

Training a large language model used to require millions of dollars and a team of PhD researchers. Those days are over.

Andrej Karpathy just released nanochat, an open-source project that trains a capable conversational AI for less than the cost of a nice dinner. The entire pipeline runs on a single 8xH100 GPU node and completes in under 2 hours.

Why This Matters Now

The AI landscape shifted dramatically in early 2026. What took OpenAI 168 hours and $43,000 in 2019 now takes 1.65 hours and $48. That’s a 100x speedup driven by algorithmic improvements, better hardware, and community optimization.

For API developers and teams building AI-powered applications, this changes everything. You can now experiment with custom model training, test architectural changes, and understand LLM internals without massive infrastructure budgets.

💡
Pair this with API development platforms like Apidog for testing and documenting your AI services, and you have a complete stack for building production AI applications.
button

What You’ll Learn

By the end of this article, you’ll understand:

What Is nanochat?

nanochat is a minimal LLM training harness that covers the entire development pipeline: tokenization, pretraining, finetuning, evaluation, inference, and a ChatGPT-like web UI.

The codebase fits in a single repository with no configuration monsters or framework complexity. Karpathy designed it as a “strong baseline” that’s readable, hackable, and forkable.

The Core Claim

Train a GPT-2 capability model (1.6B parameters) for:

For context, OpenAI’s original GPT-2 training in 2019 cost approximately $43,000 and took 7 days on 32 TPU v3 chips.

What nanochat Covers

Stage Script Description
Tokenization scripts.tok_train Train BPE tokenizer (vocab 32,768)
Pretraining scripts.base_train Train base GPT model
Finetuning scripts.chat_sft Supervised finetuning for chat
Evaluation scripts.base_eval CORE metric, bits-per-byte
Inference scripts.chat_cli CLI chat interface
Web UI scripts.chat_web ChatGPT-like web interface

The Philosophy: One Dial to Control Everything

Most LLM frameworks drown you in configuration files. nanochat takes the opposite approach.

The entire system revolves around one parameter: --depth (the number of transformer layers).

# GPT-1 size model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=12

# GPT-2 capability model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=24

# Pushing the boundaries
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26

Set the depth, and nanochat calculates everything else automatically:

This “one dial” philosophy enables what Karpathy calls the nanochat miniseries: a family of compute-optimal models at different sizes, all trained with the same principled approach.

Why This Works

The team measured scaling laws across dozens of training runs. They found predictable relationships between depth, width, batch size, and training duration. Instead of exposing all these knobs, nanochat encodes these relationships directly into the training script.

You get compute-optimal training without needing a PhD in deep learning.

The Leaderboard: Racing to Beat GPT-2

nanochat maintains a public leaderboard tracking “time to GPT-2” capability. The target is beating OpenAI’s original CORE score of 0.256525 on 22 evaluation tasks (ARC, MMLU, and others from the DCLM benchmark suite).

Current Records

Run Model Time CORE Score Key Innovation
Original GPT-2 1.6B 168 hours 0.2565 OpenAI 2019 baseline
Run 1 d24 3.04 hrs 0.2585 Initial baseline
Run 2 d26 2.91 hrs 0.2578 FP8 training
Run 3 d26 2.76 hrs 0.2602 1M token batch size
Run 4 d24 2.02 hrs 0.2571 ClimbMix dataset
Run 5 d24 1.80 hrs 0.2690 AI-discovered optimizations
Run 6 d24 1.65 hrs 0.2626 Improved smear/backout

How AI Discovered Optimizations

Runs 5 and 6 incorporated changes from Karpathy’s “autoresearch” system. An AI agent explored architectural modifications on small d12 models (5-minute training runs), then translated winning changes to the full d24 setup.

The system found improvements to:

These changes reduced training time from 2.02 hours to 1.65 hours, a 19% improvement discovered through autonomous experimentation.

How nanochat Works

The codebase contains roughly 3,000 lines across core modules. Let’s examine each component.

1. The GPT Model (nanochat/gpt.py)

The transformer follows modern best practices with several optimizations:

Architecture Features:

Value Embeddings (ResFormer):Alternating layers include learnable value embeddings mixed via input-dependent gating:

# Value residual: mix in value embedding with per-head gate
if ve is not None:
    ve = ve.view(B, T, self.n_kv_head, self.head_dim)
    gate = 3 * torch.sigmoid(self.ve_gate(x[..., :self.ve_gate_channels]))
    v = v + gate.unsqueeze(-1) * ve

This adds capacity without significant compute overhead.

Efficiency Tricks:

The model includes three learned mechanisms that improve training dynamics:

# 1. Per-layer residual scaling
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0

# 2. Smear: mix previous token embedding for bigram info
gate = self.smear_lambda * torch.sigmoid(self.smear_gate(x[:, :, :24]))
x = x + gate * x_pre_smear

# 3. Backout: subtract mid-layer residual
x = x - self.backout_lambda * x_backout

2. The Muon Optimizer (nanochat/optim.py)

nanochat uses a mixed optimizer strategy:

Parameter Type Optimizer Purpose
Embeddings, lm_head AdamW Standard adaptive optimization
Scalar parameters AdamW Learned scaling factors
2D matrices Muon Orthogonalized updates

Muon (MomentUm Orthogonalized by Newton-Schulz):

The Muon optimizer orthogonalizes weight updates using a quintic Newton-Schulz iteration called “Polar Express”:

# Polar Express coefficients (5 iterations)
polar_express_coeffs = [
    (8.156, -22.483, 15.879),
    (4.043, -2.809, 0.500),
    # ... more coefficients
]

# Orthogonalization loop
for a, b, c in polar_express_coeffs[:ns_steps]:
    A = X.mT @ X
    B = b * A + c * (A @ A)
    X = a * X + X @ B

NorMuon Variance Reduction:

After orthogonalization, updates get normalized per-neuron to prevent scale collapse:

v_mean = g.float().square().mean(dim=red_dim, keepdim=True)
v_norm = v_mean.sum(dim=(-2, -1), keepdim=True).sqrt()
final_scale = step_size * (v_norm / v_norm_new.clamp_min(1e-10))
g = g * final_scale.to(g.dtype)

Distributed Training:

For multi-GPU setups, the optimizer implements ZeRO-2 style sharding with three-phase async communication:

Phase 1: Launch all async reduce_scatter operations
Phase 2: Wait for reduces, compute updates, launch all_gathers
Phase 3: Wait for gathers, copy back updated params

This overlaps communication with computation, maximizing GPU utilization.

3. Precision Management (nanochat/common.py)

nanochat manages precision explicitly instead of using torch.amp.autocast:

Hardware Default dtype Reason
CUDA SM 80+ (A100, H100) bfloat16 Native BF16 tensor cores
CUDA SM < 80 (V100, T4) float32 No BF16 support
CPU / MPS float32 No reduced-precision cores

The custom Linear layer casts weights to match compute dtype during forward pass:

class Linear(nn.Linear):
    def forward(self, x):
        return F.linear(x, self.weight.to(dtype=x.dtype))

Master weights stay in FP32 for optimizer precision. For H100 and Blackwell GPUs, FP8 training is available via --fp8, converting most layers to Float8Linear with tensorwise scaling.

4. Data Loading (nanochat/dataloader.py)

The dataloader uses BOS-aligned best-fit packing:

This ensures every token can attend back to BOS and see full document context.

# Find largest document that fits entirely
best_idx = -1
best_len = 0
for i, doc in enumerate(doc_buffer):
    doc_len = len(doc)
    if doc_len <= remaining and doc_len > best_len:
        best_idx = i
        best_len = doc_len

if best_idx >= 0:
    doc = doc_buffer.pop(best_idx)
    # Add full document
else:
    # Crop shortest doc to fill remaining space

5. Flash Attention Unification (nanochat/flash_attention.py)

The project provides a unified interface that auto-switches between FA3 and PyTorch SDPA:

from nanochat.flash_attention import flash_attn

# Works on any hardware - auto-selects best backend
y = flash_attn.flash_attn_func(q, k, v, causal=True, window_size=window_size)

On Hopper GPUs with bfloat16, it uses Flash Attention 3. Everywhere else falls back to PyTorch’s scaled dot-product attention.

6. Inference Engine (nanochat/engine.py)

The Engine class handles efficient generation with:

The engine coordinates conversation flow, including forcing tool output tokens when the model invokes the calculator.

Step-by-Step: Train Your Own Model

The entire pipeline lives in runs/speedrun.sh. Here’s how to run it.

Prerequisites

Step 1: Environment Setup

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv sync --extra gpu

Step 2: Download Training Data

# Download ~2B characters from ClimbMix dataset
python -m nanochat.dataset -n 170

# This downloads ~170 shards at ~100MB each
# Total: ~17 GB compressed

The script downloads pretraining data shards with file locking to handle multi-rank coordination.

Step 3: Train the Tokenizer

# Train BPE tokenizer with 32,768 vocab
python -m scripts.tok_train

# Evaluate compression ratio
python -m scripts.tok_eval

The tokenizer uses a GPT-4 style split pattern with byte-fallback BPE. Training completes in ~10 minutes on 2B characters.

Step 4: Pretrain the Base Model

# Train d24 model (GPT-2 capability)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=24 \
    --target-param-data-ratio=8 \
    --device-batch-size=16 \
    --fp8 \
    --run=my-first-model

Key parameters:

Expected runtime: ~2 hours.

Step 5: Supervised Finetuning

# Download identity conversations
curl -L -o ~/.cache/nanochat/identity_conversations.jsonl \
    https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

# Run SFT for chat capability
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- \
    --device-batch-size=16 \
    --run=my-sft

This teaches the model conversation format, special tokens, and tool use.

Step 6: Chat With Your Model

# CLI chat
python -m scripts.chat_cli -p "Why is the sky blue?"

# Or launch web UI
python -m scripts.chat_web

The web UI runs on port 8000 and provides a ChatGPT-like interface.

Research Workflow: Rapid Experimentation

For testing new ideas, use smaller models for faster iteration.

Quick Experiments (~5 minutes)

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12-test" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1

This trains a d12 (GPT-1 size) model with minimal logging. Perfect for testing architectural changes.

Metrics to Monitor

Track these in Weights & Biases:

  1. val_bpb: Validation bits-per-byte (vocab-size-independent loss)
  2. core_metric: DCLM CORE evaluation score
  3. train/mfu: Model FLOPS utilization (hardware efficiency)
  4. train/tok_per_sec: Training throughput

Testing Requirements

Any improvement must work across all depths (d12 through d26). This prevents overfitting to a single model size and ensures principled advances.

Why nanochat Matters

Cost Accessibility

Approach Cost Time Hardware
OpenAI GPT-2 (2019) $43,000 168 hours 32 TPU v3
nanochat (2026) $48 2 hours 8xH100
nanochat spot ~$15 2 hours 8xH100 spot

This brings LLM training within reach of:

Educational Value

The codebase serves as a learning resource:

Students can read, modify, and experiment with a complete LLM pipeline.

Research Velocity

Reducing training from days to hours enables:

Transparency

Every design choice is documented:

Limitations and Reality Check

nanochat is impressive but has clear boundaries.

Hardware Requirements

The $48 figure assumes access to an 8xH100 node. Cloud rental costs vary:

You’ll need ~$50-100 for a full run depending on provider.

Capability Ceiling

nanochat achieves GPT-2 level performance (2019 technology). This means:

What it can do:

What it cannot do:

Think of it as a kindergartener: capable of basic conversation but not expert-level work.

Data Requirements

The full speedrun downloads:

You’ll need adequate storage and bandwidth.

Metric Limitations

The CORE score measures 22 tasks but doesn’t capture:

Different random seeds produce ~0.016 CORE variance. Your results may vary.

FAQ

How much does it cost to train a model with nanochat?

Approximately $48 on demand ($24/hour × 2 hours) or ~$15 on spot instances. This covers pretraining only. Add ~30 minutes for SFT.

What GPU do I need?

Minimum: Single GPU (any modern datacenter GPU). Optimal: 8xH100 or 8xA100 for fastest training. The code scales from 1 GPU to 8 GPUs with automatic gradient accumulation.

How long does training take?

1.65 to 3 hours depending on configuration and hardware. The current leaderboard record is 1.65 hours for a d24 model.

What is the CORE metric?

The DCLM CORE score evaluates models on 22 tasks including ARC (science questions), MMLU (multi-task language understanding), and other benchmarks. GPT-2 scored 0.256525. nanochat regularly exceeds 0.26.

Can I train on a single GPU?

Yes. Omit torchrun and the code automatically uses gradient accumulation. Training will take 8× longer but produces nearly identical results.

What dataset does nanochat use?

The current best uses ClimbMix (NVIDIA’s curated web dataset). Previous versions used FineWeb-EDU. The tokenizer trains on ~2B characters from the first ~8 shards.

Does nanochat work on Apple Silicon?

Yes. The code runs on MPS (Metal Performance Shaders) with float32 precision. Training is slower than CUDA but functional for experimentation.

Can I resume training from a checkpoint?

Yes. Use --resume-from-step=<step> to continue from a saved checkpoint. The dataloader state is also saved for exact resumption.

What’s the difference between nanochat and nanoGPT?

nanoGPT covered pretraining only. nanochat extends to the full pipeline: tokenization, pretraining, SFT, RLHF, evaluation, inference, and web UI.

Conclusion

nanochat proves that LLM training no longer requires massive budgets or specialized infrastructure. What cost $43,000 in 2019 now costs under $50.

The project’s impact extends beyond raw cost reduction. By providing a minimal, readable codebase with a “one dial” interface, Karpathy has created both a research tool and an educational resource.

Key Takeaways

Next Steps

Ready to train your own model? Start with the nanochat repository and the runs/speedrun.sh script.

For API developers building AI-powered applications, understanding LLM training internals has never been more accessible. The barrier to entry has dropped from “venture-funded startup” to “weekend project.”

button

Explore more

How to Remove Censorship from LLM Models with Heretic

How to Remove Censorship from LLM Models with Heretic

Learn how Heretic automatically removes safety filters from language models using directional ablation. Complete guide with installation, usage, and ethical deployment practices.

19 March 2026

Free Codex for Open Source Developers: Here is How to Apply

Free Codex for Open Source Developers: Here is How to Apply

Discover how to obtain the Free Codex for Open Source, including eligibility requirements, the application process, and real-world usage tips for open source developers.

19 March 2026

How to Test LLM Applications: The Complete Guide to Promptfoo (2026)

How to Test LLM Applications: The Complete Guide to Promptfoo (2026)

Learn how to test LLM applications with Promptfoo. Complete guide covering automated evals, red team security scanning, and CI/CD integration.

19 March 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs