Chrome shipped an AI model directly inside the browser. The Prompt API is the JavaScript surface you call to use it. No API key, no network round trip, no per-token bill. The model is Gemini Nano, it runs on the user’s device, and as of Chrome 138 it is generally available for extensions and behind a flag for web pages. For API developers, this changes what is reasonable to do on the client.
This guide covers what the Chrome Prompt API is, how it differs from the cloud Gemini API, when it actually fits an API workflow, hands-on code for both extensions and web pages, and the limits you will hit faster than the docs admit. We pair it with Apidog at the end so the same tasks have a fallback path when the model is not available.
TL;DR
- The Chrome Prompt API exposes Gemini Nano through
LanguageModel, available onwindow.LanguageModelfor web pages andchrome.languageModelfor extensions. - The model runs entirely on-device. No network call, no key, no token cost.
- Stable for Chrome extensions in Chrome 138+. For web pages, it is shipping behind the
chrome://flags/#prompt-api-for-gemini-nanoflag and a registered Origin Trial. - Best uses for API developers: client-side input parsing, JSON shape repair, summarizing API responses for UI, and stub generation during development.
- Always wire a cloud fallback. The on-device model fails open; your code should not.
What the Prompt API actually exposes
The Prompt API is one of a small group of “Built-in AI” APIs Chrome started shipping last year. The others are narrower: Summarizer, Writer, Rewriter, Translator, and Language Detector. The Prompt API is the general-purpose surface; the others wrap it with task-specific defaults.

Three primitives matter:
LanguageModel.availability(). Returnsavailable,downloadable,downloading, orunavailable. The model is around 2 GB and downloads in the background the first time a site requests it.LanguageModel.create(options). Spins up a session. The session holds turn state, system prompt, and a few sampling knobs.session.prompt(text)andsession.promptStreaming(text). The two ways to actually call the model.
The shape is intentionally close to the Gemini cloud SDK but trimmed. There is no tool calling yet, no image input on the stable channel (it is in Origin Trial), and the context window is small (4K tokens of input, 1K of output, with a soft expand to 8K total).
A first call from a web page looks like this:
if (!('LanguageModel' in window)) {
console.warn('Prompt API not available. Falling back to cloud.');
} else {
const status = await LanguageModel.availability();
if (status === 'unavailable') {
console.warn('Device does not support Gemini Nano.');
} else {
if (status !== 'available') {
// Triggers a background download. Show a UI.
await LanguageModel.create({ monitor(m) {
m.addEventListener('downloadprogress', e => {
console.log(`downloaded ${(e.loaded * 100).toFixed(0)}%`);
});
}});
}
const session = await LanguageModel.create({
systemPrompt: 'You answer in three concise bullets. JSON only.',
});
const reply = await session.prompt(
'Summarize this changelog in three bullets.\n\n' + changelog
);
console.log(reply);
}
}
Every meaningful piece is on display in the snippet: feature detection, availability check, optional download, session creation, system prompt, prompt call.
How it differs from the cloud Gemini API
Same family, different deployment. The differences shape what you can and cannot build on it.
| Property | Chrome Prompt API | Gemini API (cloud) |
|---|---|---|
| Model | Gemini Nano (on-device) | gemini-3-flash, gemini-3-flash-preview, gemini-3-pro |
| Cost per call | Zero | Per-token billing |
| Latency | 50 to 300 ms typical first token | 200 to 800 ms first token |
| Network | None required after model download | Required every call |
| Privacy | Input never leaves the device | Sent to Google servers |
| Context window | 4K input / 1K output (8K combined) | Up to 1M tokens |
| Tool calling | No (planned) | Yes |
| Multimodal | Image input in Origin Trial | Yes |
| JSON mode | Best-effort via system prompt | First-class with schema |
| Availability | Chrome only, capable hardware only | Any client with network |
The on-device model is roughly two orders of magnitude smaller than gemini-3-flash. Use it for short tasks where you would have shipped a regex or a hand-tuned prompt classifier. Do not use it as a drop-in for cloud Gemini.
Where it actually fits in an API developer’s workflow
Four use cases pay for the integration cost. Outside these, the cloud API is still the right call.
1. Parsing and reshaping user input client-side.Take a free-form query and turn it into a structured filter for your API. The user types “stripe charges over $100 last week”; the Prompt API turns it into { "amount_gt": 100, "since": "2026-04-22", "provider": "stripe" } before you call your search endpoint. Saves a round trip and protects user privacy.
2. Summarizing API responses for UI.You hit your own API, get back 40 records, and need a one-line summary to show in a card. Sending the records to a cloud model adds latency and cost. The Prompt API runs locally and returns in under 200 ms.
3. JSON shape repair.LLM responses arrive malformed often enough to matter. Run a one-shot repair pass through Gemini Nano: “Here is invalid JSON. Return only valid JSON with the same fields.” Cheap, fast, no cost.
4. Local stubbing during development.While you are wiring a new endpoint and the backend is half-built, generate plausible response bodies on the fly. The shapes will not be production-correct, but they unblock front-end work. Combine with Apidog’s mock server for a hybrid setup where critical endpoints come from saved examples and exploratory ones come from the Prompt API.
Building it into an extension
Extensions get the Prompt API on the stable channel from Chrome 138 onwards. You declare the permission and call chrome.languageModel.
manifest.json:
{
"manifest_version": 3,
"name": "Endpoint Summarizer",
"version": "1.0.0",
"permissions": ["languageModel"],
"action": { "default_popup": "popup.html" }
}
popup.js:
const status = await chrome.languageModel.availability();
if (status === 'unavailable') {
document.getElementById('out').textContent =
'Device does not support on-device AI.';
return;
}
const session = await chrome.languageModel.create({
systemPrompt: [
'You summarize HTTP responses in three short bullets.',
'Mention status, the most-changed field, and any error keys.',
].join(' '),
temperature: 0.3,
topK: 3,
});
document.getElementById('go').addEventListener('click', async () => {
const tab = await chrome.tabs.query({ active: true, currentWindow: true });
const [{ result }] = await chrome.scripting.executeScript({
target: { tabId: tab[0].id },
func: () => document.body.innerText.slice(0, 4000),
});
const stream = session.promptStreaming(result);
const out = document.getElementById('out');
out.textContent = '';
for await (const chunk of stream) {
out.textContent += chunk;
}
});
Two things worth calling out. temperature and topK are the only sampling knobs the API exposes; topP is not supported on the stable channel. Streaming is an async iterator, not server-sent events, so the consumption pattern is for await rather than the SSE reader you would write for cloud Gemini.
Building it into a web page
Web pages need the user to flip a flag or your origin to be enrolled in the Origin Trial. The trial token goes in a meta tag.
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="origin-trial" content="YOUR_TRIAL_TOKEN_HERE" />
</head>
<body>
<textarea id="in" placeholder="Paste an API response..."></textarea>
<button id="go">Summarize</button>
<pre id="out"></pre>
<script type="module">
if (!('LanguageModel' in window)) {
document.getElementById('out').textContent =
'Prompt API not available in this browser.';
} else {
const session = await LanguageModel.create({
systemPrompt: 'Reply in JSON: { "summary": "...", "tags": [...] }',
temperature: 0.2,
});
document.getElementById('go').onclick = async () => {
const text = document.getElementById('in').value;
const reply = await session.prompt(text);
try {
document.getElementById('out').textContent =
JSON.stringify(JSON.parse(reply), null, 2);
} catch {
document.getElementById('out').textContent = reply;
}
};
}
</script>
</body>
</html>
If you want to test the page without an Origin Trial token, open chrome://flags/#prompt-api-for-gemini-nano, set it to Enabled, and restart Chrome. The flag has been stable across the last six versions but is not promised to stay forever; ship the Origin Trial path if you want predictable behavior.
Limits and gotchas the docs do not stress enough
Six things that will trip you up.
- Context is small. 4K input, 1K output. Truncate aggressively. Do not paste a 50K-token JSON document and expect a useful answer.
- Hardware support is uneven. The model needs roughly 4 GB of VRAM or unified memory and runs only on Chrome 138+ on Windows, macOS, Linux, and recent ChromeOS. Mobile Chrome is not supported as of this writing.
- First load is slow. The 2 GB download happens in the background but blocks the first session. Always show a download progress UI.
- No tool calling. If your task needs the model to call your API, do that on the client yourself; the model only decides what to call.
- System prompt drift. The on-device model follows system prompts less rigidly than the cloud variants. Pin the format with one-shot examples in the system prompt.
- Permissions matter. Extensions need
"languageModel"inpermissions. Forget it and the API silently returnsunavailable.
Wire a cloud fallback before you ship
Your app ships to users who do not have the model. Always wire a fallback. The pattern is short:
async function summarize(text) {
if ('LanguageModel' in window) {
const status = await LanguageModel.availability();
if (status === 'available') {
const session = await LanguageModel.create({
systemPrompt: 'Reply with one bullet summary, max 12 words.',
});
return session.prompt(text);
}
}
// Fallback: call your server, which calls cloud Gemini or your own model.
const r = await fetch('/api/summarize', {
method: 'POST', body: JSON.stringify({ text }),
});
return (await r.json()).summary;
}
Privacy and what to tell users
The selling point of the Prompt API is that input never leaves the device. That is true today and is the explicit design intent of the Built-in AI initiative. Two qualifiers worth knowing:
- The model itself was trained by Google on Google’s data; running it locally does not change that. Chrome ships the weights with the browser update.
- Telemetry about usage may still be reported by Chrome under the user’s normal Chrome telemetry settings. The prompt content is not part of that telemetry.
For most consumer apps this is a strong privacy story you can put in your UI without legal review. For regulated workloads (HIPAA, PCI), confirm with counsel before you depend on it.
When to skip the Prompt API
Pick the cloud Gemini API instead when:
- Your task needs more than 4K tokens of input.
- You need tool calling, structured output with schema enforcement, or multimodal inputs beyond Origin Trial.
- You serve users on Safari, Firefox, or mobile Chrome. Browser support is Chrome-only today and Apple has not signaled a shipping date.
- Output quality matters more than latency. Nano is small; Pro is not.
For the open-weight angle, How to run DeepSeek V4 locally covers running a meaningfully larger model on a developer machine without leaving the local network.
FAQ
Is the Prompt API in the official web standards process?It is in the W3C WebML community group as a proposal. Treat it as Chrome-specific until other engines ship.
Can I use it from a service worker?In Chrome 138+, yes for extensions. Web pages currently restrict it to the document context. Check the docs before you ship to a service worker.
What model size am I actually getting?Gemini Nano is in the 2-4B parameter range, quantized to fit. Google has not committed to a specific size; expect it to grow.
Does it support function calling?No on the stable channel. The Origin Trial branch has experimental tool support; do not depend on it for production.
How do I test it in an automated CI?You cannot run the on-device model in headless Chromium yet. Mock the LanguageModel global in tests and run the cloud fallback path in CI.
Is it free for commercial use?Yes. There is no per-call billing. You bear the storage cost on the user’s device (around 2 GB) and Chrome handles updates.
For teams already running cloud-side LLM workflows alongside this, the What is GPT-5.5 post covers the cloud-side trade-offs in more depth, and Apidog handles the mock and fallback wiring without a separate testing tool.
