TL;DR
nanochat is Andrej Karpathy’s open-source LLM training framework that lets you train a GPT-2 level chatbot for under $50 in about 2 hours. The project uses a single 8xH100 GPU node, minimal code (~500 lines for the core model), and one configuration dial (--depth) to automatically optimize all hyperparameters. Current records show training completion in 1.65 hours with a CORE score of 0.2626, beating OpenAI’s 2019 GPT-2 that cost $43,000 and took 168 hours.
Introduction
Training a large language model used to require millions of dollars and a team of PhD researchers. Those days are over.
Andrej Karpathy just released nanochat, an open-source project that trains a capable conversational AI for less than the cost of a nice dinner. The entire pipeline runs on a single 8xH100 GPU node and completes in under 2 hours.
Why This Matters Now
The AI landscape shifted dramatically in early 2026. What took OpenAI 168 hours and $43,000 in 2019 now takes 1.65 hours and $48. That’s a 100x speedup driven by algorithmic improvements, better hardware, and community optimization.
For API developers and teams building AI-powered applications, this changes everything. You can now experiment with custom model training, test architectural changes, and understand LLM internals without massive infrastructure budgets.
What You’ll Learn
By the end of this article, you’ll understand:
- How nanochat achieves 100x cost reduction vs traditional LLM training
- The complete architecture (GPT model, Muon optimizer, data loading)
- Step-by-step instructions to train your own model
- How to use nanochat for rapid LLM research and experimentation
- Real limitations and what GPT-2 capability actually means
What Is nanochat?
nanochat is a minimal LLM training harness that covers the entire development pipeline: tokenization, pretraining, finetuning, evaluation, inference, and a ChatGPT-like web UI.

The codebase fits in a single repository with no configuration monsters or framework complexity. Karpathy designed it as a “strong baseline” that’s readable, hackable, and forkable.
The Core Claim
Train a GPT-2 capability model (1.6B parameters) for:
- $48 on demand (2 hours at ~$24/hour for 8xH100)
- ~$15 on spot instances
For context, OpenAI’s original GPT-2 training in 2019 cost approximately $43,000 and took 7 days on 32 TPU v3 chips.
What nanochat Covers
| Stage | Script | Description |
|---|---|---|
| Tokenization | scripts.tok_train |
Train BPE tokenizer (vocab 32,768) |
| Pretraining | scripts.base_train |
Train base GPT model |
| Finetuning | scripts.chat_sft |
Supervised finetuning for chat |
| Evaluation | scripts.base_eval |
CORE metric, bits-per-byte |
| Inference | scripts.chat_cli |
CLI chat interface |
| Web UI | scripts.chat_web |
ChatGPT-like web interface |
The Philosophy: One Dial to Control Everything
Most LLM frameworks drown you in configuration files. nanochat takes the opposite approach.
The entire system revolves around one parameter: --depth (the number of transformer layers).
# GPT-1 size model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=12
# GPT-2 capability model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=24
# Pushing the boundaries
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26
Set the depth, and nanochat calculates everything else automatically:
- Transformer width (embedding dimension)
- Number of attention heads
- Learning rates for each parameter group
- Training horizon (total steps)
- Weight decay schedules
- Batch sizes
This “one dial” philosophy enables what Karpathy calls the nanochat miniseries: a family of compute-optimal models at different sizes, all trained with the same principled approach.
Why This Works
The team measured scaling laws across dozens of training runs. They found predictable relationships between depth, width, batch size, and training duration. Instead of exposing all these knobs, nanochat encodes these relationships directly into the training script.

You get compute-optimal training without needing a PhD in deep learning.
The Leaderboard: Racing to Beat GPT-2
nanochat maintains a public leaderboard tracking “time to GPT-2” capability. The target is beating OpenAI’s original CORE score of 0.256525 on 22 evaluation tasks (ARC, MMLU, and others from the DCLM benchmark suite).
Current Records
| Run | Model | Time | CORE Score | Key Innovation |
|---|---|---|---|---|
| Original GPT-2 | 1.6B | 168 hours | 0.2565 | OpenAI 2019 baseline |
| Run 1 | d24 | 3.04 hrs | 0.2585 | Initial baseline |
| Run 2 | d26 | 2.91 hrs | 0.2578 | FP8 training |
| Run 3 | d26 | 2.76 hrs | 0.2602 | 1M token batch size |
| Run 4 | d24 | 2.02 hrs | 0.2571 | ClimbMix dataset |
| Run 5 | d24 | 1.80 hrs | 0.2690 | AI-discovered optimizations |
| Run 6 | d24 | 1.65 hrs | 0.2626 | Improved smear/backout |
How AI Discovered Optimizations
Runs 5 and 6 incorporated changes from Karpathy’s “autoresearch” system. An AI agent explored architectural modifications on small d12 models (5-minute training runs), then translated winning changes to the full d24 setup.
The system found improvements to:
- Backout mechanism: Better mid-layer residual subtraction
- Smear implementation: More efficient bigram mixing from previous tokens
These changes reduced training time from 2.02 hours to 1.65 hours, a 19% improvement discovered through autonomous experimentation.
How nanochat Works
The codebase contains roughly 3,000 lines across core modules. Let’s examine each component.
1. The GPT Model (nanochat/gpt.py)
The transformer follows modern best practices with several optimizations:
Architecture Features:
- Rotary embeddings (RoPE): Relative positional encoding without learned position embeddings
- QK normalization: Stabilizes training at scale
- Untied weights: Separate token embedding and output projection layers
- ReLU² activation: Squared ReLU in MLP instead of GeLU
- Grouped Query Attention (GQA): Fewer KV heads than query heads for faster inference
- Sliding window attention: Configurable pattern (e.g., “SSSL” alternates short/long context)
- Flash Attention 3: Hopper GPU optimization with SDPA fallback
Value Embeddings (ResFormer):Alternating layers include learnable value embeddings mixed via input-dependent gating:
# Value residual: mix in value embedding with per-head gate
if ve is not None:
ve = ve.view(B, T, self.n_kv_head, self.head_dim)
gate = 3 * torch.sigmoid(self.ve_gate(x[..., :self.ve_gate_channels]))
v = v + gate.unsqueeze(-1) * ve
This adds capacity without significant compute overhead.
Efficiency Tricks:
The model includes three learned mechanisms that improve training dynamics:
# 1. Per-layer residual scaling
x = self.resid_lambdas[i] * x + self.x0_lambdas[i] * x0
# 2. Smear: mix previous token embedding for bigram info
gate = self.smear_lambda * torch.sigmoid(self.smear_gate(x[:, :, :24]))
x = x + gate * x_pre_smear
# 3. Backout: subtract mid-layer residual
x = x - self.backout_lambda * x_backout
2. The Muon Optimizer (nanochat/optim.py)
nanochat uses a mixed optimizer strategy:
| Parameter Type | Optimizer | Purpose |
|---|---|---|
| Embeddings, lm_head | AdamW | Standard adaptive optimization |
| Scalar parameters | AdamW | Learned scaling factors |
| 2D matrices | Muon | Orthogonalized updates |
Muon (MomentUm Orthogonalized by Newton-Schulz):
The Muon optimizer orthogonalizes weight updates using a quintic Newton-Schulz iteration called “Polar Express”:
# Polar Express coefficients (5 iterations)
polar_express_coeffs = [
(8.156, -22.483, 15.879),
(4.043, -2.809, 0.500),
# ... more coefficients
]
# Orthogonalization loop
for a, b, c in polar_express_coeffs[:ns_steps]:
A = X.mT @ X
B = b * A + c * (A @ A)
X = a * X + X @ B
NorMuon Variance Reduction:
After orthogonalization, updates get normalized per-neuron to prevent scale collapse:
v_mean = g.float().square().mean(dim=red_dim, keepdim=True)
v_norm = v_mean.sum(dim=(-2, -1), keepdim=True).sqrt()
final_scale = step_size * (v_norm / v_norm_new.clamp_min(1e-10))
g = g * final_scale.to(g.dtype)
Distributed Training:
For multi-GPU setups, the optimizer implements ZeRO-2 style sharding with three-phase async communication:
Phase 1: Launch all async reduce_scatter operations
Phase 2: Wait for reduces, compute updates, launch all_gathers
Phase 3: Wait for gathers, copy back updated params
This overlaps communication with computation, maximizing GPU utilization.
3. Precision Management (nanochat/common.py)
nanochat manages precision explicitly instead of using torch.amp.autocast:
| Hardware | Default dtype | Reason |
|---|---|---|
| CUDA SM 80+ (A100, H100) | bfloat16 | Native BF16 tensor cores |
| CUDA SM < 80 (V100, T4) | float32 | No BF16 support |
| CPU / MPS | float32 | No reduced-precision cores |
The custom Linear layer casts weights to match compute dtype during forward pass:
class Linear(nn.Linear):
def forward(self, x):
return F.linear(x, self.weight.to(dtype=x.dtype))
Master weights stay in FP32 for optimizer precision. For H100 and Blackwell GPUs, FP8 training is available via --fp8, converting most layers to Float8Linear with tensorwise scaling.
4. Data Loading (nanochat/dataloader.py)
The dataloader uses BOS-aligned best-fit packing:
- Every row starts with BOS (Beginning of Sequence) token
- Documents packed using best-fit algorithm to minimize waste
- When no document fits, one gets cropped to fill exactly
- 100% utilization with ~35% token cropping at 2048 sequence length
This ensures every token can attend back to BOS and see full document context.
# Find largest document that fits entirely
best_idx = -1
best_len = 0
for i, doc in enumerate(doc_buffer):
doc_len = len(doc)
if doc_len <= remaining and doc_len > best_len:
best_idx = i
best_len = doc_len
if best_idx >= 0:
doc = doc_buffer.pop(best_idx)
# Add full document
else:
# Crop shortest doc to fill remaining space
5. Flash Attention Unification (nanochat/flash_attention.py)
The project provides a unified interface that auto-switches between FA3 and PyTorch SDPA:
from nanochat.flash_attention import flash_attn
# Works on any hardware - auto-selects best backend
y = flash_attn.flash_attn_func(q, k, v, causal=True, window_size=window_size)
On Hopper GPUs with bfloat16, it uses Flash Attention 3. Everywhere else falls back to PyTorch’s scaled dot-product attention.
6. Inference Engine (nanochat/engine.py)
The Engine class handles efficient generation with:
- KV Cache: Pre-filled prompt cache with FA3’s
flash_attn_with_kvcache - Tool Use: Special tokens trigger Python calculator via
eval() - Batch Generation: Clone KV cache for parallel sampling
The engine coordinates conversation flow, including forcing tool output tokens when the model invokes the calculator.
Step-by-Step: Train Your Own Model
The entire pipeline lives in runs/speedrun.sh. Here’s how to run it.
Prerequisites
- 8xH100 GPU node (or similar)
- ~20 GB disk space for dataset
- Python 3.10+
- uv package manager
Step 1: Environment Setup
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv sync --extra gpu
Step 2: Download Training Data
# Download ~2B characters from ClimbMix dataset
python -m nanochat.dataset -n 170
# This downloads ~170 shards at ~100MB each
# Total: ~17 GB compressed
The script downloads pretraining data shards with file locking to handle multi-rank coordination.
Step 3: Train the Tokenizer
# Train BPE tokenizer with 32,768 vocab
python -m scripts.tok_train
# Evaluate compression ratio
python -m scripts.tok_eval
The tokenizer uses a GPT-4 style split pattern with byte-fallback BPE. Training completes in ~10 minutes on 2B characters.
Step 4: Pretrain the Base Model
# Train d24 model (GPT-2 capability)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=24 \
--target-param-data-ratio=8 \
--device-batch-size=16 \
--fp8 \
--run=my-first-model
Key parameters:
--depth=24: GPT-2 size model--target-param-data-ratio=8: Slightly undertrained for speed--device-batch-size=16: Per-GPU batch size--fp8: Enable FP8 training (H100+ only)
Expected runtime: ~2 hours.
Step 5: Supervised Finetuning
# Download identity conversations
curl -L -o ~/.cache/nanochat/identity_conversations.jsonl \
https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
# Run SFT for chat capability
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- \
--device-batch-size=16 \
--run=my-sft
This teaches the model conversation format, special tokens, and tool use.
Step 6: Chat With Your Model
# CLI chat
python -m scripts.chat_cli -p "Why is the sky blue?"
# Or launch web UI
python -m scripts.chat_web
The web UI runs on port 8000 and provides a ChatGPT-like interface.
Research Workflow: Rapid Experimentation
For testing new ideas, use smaller models for faster iteration.
Quick Experiments (~5 minutes)
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=12 \
--run="d12-test" \
--core-metric-every=999999 \
--sample-every=-1 \
--save-every=-1
This trains a d12 (GPT-1 size) model with minimal logging. Perfect for testing architectural changes.
Metrics to Monitor
Track these in Weights & Biases:
- val_bpb: Validation bits-per-byte (vocab-size-independent loss)
- core_metric: DCLM CORE evaluation score
- train/mfu: Model FLOPS utilization (hardware efficiency)
- train/tok_per_sec: Training throughput
Testing Requirements
Any improvement must work across all depths (d12 through d26). This prevents overfitting to a single model size and ensures principled advances.
Why nanochat Matters
Cost Accessibility
| Approach | Cost | Time | Hardware |
|---|---|---|---|
| OpenAI GPT-2 (2019) | $43,000 | 168 hours | 32 TPU v3 |
| nanochat (2026) | $48 | 2 hours | 8xH100 |
| nanochat spot | ~$15 | 2 hours | 8xH100 spot |
This brings LLM training within reach of:
- Individual researchers
- Small startups
- University courses
- Hobbyists
Educational Value
The codebase serves as a learning resource:
- ~500 lines for GPT model
- ~530 lines for optimizer
- Clear comments on every design decision
- No hidden configuration
Students can read, modify, and experiment with a complete LLM pipeline.
Research Velocity
Reducing training from days to hours enables:
- Faster hypothesis testing
- More experiments per week
- Lower cost of failure
- Community collaboration via leaderboard
Transparency
Every design choice is documented:
- Scaling laws in
dev/LOG.md - Ablation studies in GitHub Discussions
- Full reproduction details for leaderboard entries
- Clear AI contribution disclosure
Limitations and Reality Check
nanochat is impressive but has clear boundaries.
Hardware Requirements
The $48 figure assumes access to an 8xH100 node. Cloud rental costs vary:
- Lambda Labs: ~$25/hour for 8xH100
- RunPod: ~$15/hour spot pricing
- Total runtime: ~2 hours pretraining + SFT
You’ll need ~$50-100 for a full run depending on provider.
Capability Ceiling
nanochat achieves GPT-2 level performance (2019 technology). This means:
What it can do:
- Basic conversation
- Simple reasoning
- Elementary math
- Factual recall (limited)
What it cannot do:
- Complex multi-step reasoning
- Code generation beyond simple functions
- Nuanced instruction following
- Competitive with GPT-4, Claude, or Gemini
Think of it as a kindergartener: capable of basic conversation but not expert-level work.
Data Requirements
The full speedrun downloads:
- ~170 data shards
- ~17 GB compressed
- ~2B characters total
You’ll need adequate storage and bandwidth.
Metric Limitations
The CORE score measures 22 tasks but doesn’t capture:
- Real-world conversation quality
- Domain-specific knowledge
- Instruction following nuance
- Safety and alignment
Different random seeds produce ~0.016 CORE variance. Your results may vary.
FAQ
How much does it cost to train a model with nanochat?
Approximately $48 on demand ($24/hour × 2 hours) or ~$15 on spot instances. This covers pretraining only. Add ~30 minutes for SFT.
What GPU do I need?
Minimum: Single GPU (any modern datacenter GPU). Optimal: 8xH100 or 8xA100 for fastest training. The code scales from 1 GPU to 8 GPUs with automatic gradient accumulation.
How long does training take?
1.65 to 3 hours depending on configuration and hardware. The current leaderboard record is 1.65 hours for a d24 model.
What is the CORE metric?
The DCLM CORE score evaluates models on 22 tasks including ARC (science questions), MMLU (multi-task language understanding), and other benchmarks. GPT-2 scored 0.256525. nanochat regularly exceeds 0.26.
Can I train on a single GPU?
Yes. Omit torchrun and the code automatically uses gradient accumulation. Training will take 8× longer but produces nearly identical results.
What dataset does nanochat use?
The current best uses ClimbMix (NVIDIA’s curated web dataset). Previous versions used FineWeb-EDU. The tokenizer trains on ~2B characters from the first ~8 shards.
Does nanochat work on Apple Silicon?
Yes. The code runs on MPS (Metal Performance Shaders) with float32 precision. Training is slower than CUDA but functional for experimentation.
Can I resume training from a checkpoint?
Yes. Use --resume-from-step=<step> to continue from a saved checkpoint. The dataloader state is also saved for exact resumption.
What’s the difference between nanochat and nanoGPT?
nanoGPT covered pretraining only. nanochat extends to the full pipeline: tokenization, pretraining, SFT, RLHF, evaluation, inference, and web UI.
Conclusion
nanochat proves that LLM training no longer requires massive budgets or specialized infrastructure. What cost $43,000 in 2019 now costs under $50.
The project’s impact extends beyond raw cost reduction. By providing a minimal, readable codebase with a “one dial” interface, Karpathy has created both a research tool and an educational resource.
Key Takeaways
- 100x cost reduction: From $43,000 to $48 for GPT-2 capability
- 100x speedup: From 168 hours to 1.65 hours
- Single configuration dial:
--depthcontrols everything - Full pipeline: Tokenization through web UI
- Community driven: Public leaderboard with continuous improvements
Next Steps
Ready to train your own model? Start with the nanochat repository and the runs/speedrun.sh script.
For API developers building AI-powered applications, understanding LLM training internals has never been more accessible. The barrier to entry has dropped from “venture-funded startup” to “weekend project.”



