TL;DR / Quick Answer
GPT-5.4 is OpenAI's most advanced frontier model for professional work, released March 5, 2026. It combines industry-leading coding capabilities from GPT-5.3-Codex with enhanced reasoning, computer use, and tool integration. The model achieves 83% win rate on knowledge work tasks, 75% on computer use benchmarks, and uses significantly fewer tokens than GPT-5.2. Available via API at $2.50/M input tokens and $15/M output tokens, with Pro version ($30/$180) for complex tasks.
Introduction
OpenAI just raised the bar for AI-powered professional work. On March 5, 2026, they released GPT-5.4, a model that delivers 83% win rates against industry professionals on real-world knowledge work tasks while using significantly fewer tokens than its predecessor.
If you've worked with AI models that hallucinate facts, struggle with complex workflows, or burn through tokens on simple tasks, GPT-5.4 addresses these pain points directly. It's 33% less likely to make factual errors and completes computer-use tasks 3x faster than earlier models.
This guide breaks down what GPT-5.4 actually does, how it compares to previous versions, and whether the performance gains justify the higher token costs. You'll get specific benchmark data, real performance comparisons, and clear guidance on which GPT-5.4 variant fits your use case.
What you'll learn:
- Exact performance improvements over GPT-5.2 and GPT-5.3-Codex
- Benchmark scores across coding, computer use, and knowledge work
- New computer use and vision capabilities with real examples
- Pricing breakdown and when to use Pro vs standard
- Integration considerations for API developers
What Is GPT-5.4?
GPT-5.4 represents OpenAI's first general-purpose model with native computer use capabilities. It merges the coding excellence of GPT-5.3-Codex with enhanced reasoning, visual perception, and tool integration into a single frontier model.

The model targets three core professional scenarios:
Knowledge work - Creating spreadsheets, presentations, documents, and analysis across 44 occupations. GPT-5.4 matches or exceeds industry professionals in 83% of comparisons on GDPval, up from 70.9% for GPT-5.2.
Computer use and agents - Operating computers through mouse/keyboard commands, browser automation, and multi-step workflows across applications. Achieves 75% success rate on OSWorld-Verified, surpassing human performance at 72.4%.
Coding and development - Writing, debugging, and iterating on code with state-of-the-art performance on SWE-Bench Pro (57.7%) while supporting up to 1M token context windows for complex codebases.
GPT-5.4 comes in two variants:
- GPT-5.4 - Standard model for most professional tasks
- GPT-5.4 Pro - Maximum performance on complex reasoning tasks ($30/M input, $180/M output)
Key Improvements Over GPT-5.2
GPT-5.4 isn't an incremental update. OpenAI made substantial gains across four critical areas.
1. Factual Accuracy and Hallucination Reduction
False claims dropped 33% at the individual claim level. Full responses contain 18% fewer errors overall. This matters when you're generating legal documents, financial models, or technical documentation where a single hallucinated fact can derail an entire project.
2. Token Efficiency
GPT-5.4 uses significantly fewer tokens to solve problems compared to GPT-5.2. In tool-heavy workflows with MCP Atlas benchmarks, token usage dropped 47% while maintaining accuracy. For high-volume API users, this efficiency gain offsets the higher per-token pricing.
3. Computer Use Capabilities
Previous models required separate specialized models for computer use. GPT-5.4 handles this natively:
- Issues mouse and keyboard commands from screenshots
- Automates browsers via Playwright
- Navigates desktop environments through coordinate-based interactions
- Supports custom safety policies and confirmation requirements
4. Tool Search and Integration
Tool search eliminates the need to load thousands of tool definitions into every request. The model looks up tool definitions on-demand, reducing upfront token costs and enabling work with ecosystems containing tens of thousands of tools.
On Toolathlon benchmark, GPT-5.4 achieves 54.6% accuracy compared to 45.7% for GPT-5.2, with fewer tool yields (latency proxy) required.
GPT-5.4 Performance Benchmarks
Benchmark data shows where GPT-5.4 excels and where earlier models remain competitive.
Knowledge Work (GDPval)
Model | Win Rate vs Professionals |
|---|---|
GPT-5.4 | 83.0% |
GPT-5.4 Pro | 82.0% |
GPT-5.2 Pro | 74.1% |
GPT-5.2 | 70.9% |
GDPval tests well-specified knowledge work across 44 occupations from the top 9 industries contributing to US GDP. Tasks include sales presentations, accounting spreadsheets, urgent care schedules, manufacturing diagrams, and short videos.
Spreadsheet and Document Creation
On internal investment banking modeling tasks:
- GPT-5.4: 87.3% mean score
- GPT-5.2: 68.4% mean score
For presentation evaluation, human raters preferred GPT-5.4 outputs 68% of the time due to stronger aesthetics, greater visual variety, and more effective image generation use.
Coding Performance (SWE-Bench Pro)
Model | Accuracy | Estimated Latency |
|---|---|---|
GPT-5.4 | 57.7% | ~1000s |
GPT-5.3-Codex | 56.8% | ~1200s |
GPT-5.2 | 55.6% | ~1500s |

GPT-5.4 matches or exceeds GPT-5.3-Codex on SWE-Bench Pro while delivering lower latency across reasoning efforts. The /fast mode in Codex delivers up to 1.5x faster token velocity with GPT-5.4.
Computer Use (OSWorld-Verified)
OSWorld-Verified measures success at navigating desktop environments through screenshots and keyboard/mouse actions:
- GPT-5.4: 75.0%
- GPT-5.3-Codex: 74.0% (with API parameter preserving original image resolution)
- GPT-5.2: 47.3%
- Human performance: 72.4%
This benchmark tests real desktop workflows: email and calendar management, bulk data entry, file operations, and cross-application tasks.
Web Browsing (BrowseComp)
BrowseComp tests persistent web research to find hard-to-locate information:
- GPT-5.4 Pro: 89.3%
- GPT-5.4: 82.7%
- GPT-5.2 Pro: 77.9%
- GPT-5.2: 65.8%
The 17% absolute improvement over GPT-5.2 reflects better synthesis of multi-source information and more persistent search strategies.
Visual Understanding
MMMU Pro (no tools) - Tests visual understanding and reasoning:
- GPT-5.4: 81.2%
- GPT-5.2: 79.5%
OmniDocBench - Document parsing accuracy (lower error = better):
- GPT-5.4: 0.109 normalized edit distance
- GPT-5.2: 0.140 normalized edit distance
Computer Use and Vision Capabilities
GPT-5.4's computer use capabilities warrant detailed examination. This is the first general-purpose OpenAI model that can operate computers natively.
How Computer Use Works
The model interprets screenshots of browser or desktop interfaces and responds with:
- Coordinate-based clicking on UI elements
- Keyboard input for text entry
- Playwright commands for browser automation
- Mouse movements and drag operations
Developers configure behavior through system messages, adjusting safety policies and confirmation requirements based on risk tolerance.
Real-World Computer Use Example
Mainstay tested GPT-5.4 across approximately 30,000 HOA and property tax portals:
- GPT-5.4: 95% first-attempt success, 100% within three attempts
- Previous CUA models: 73-79% success rate
- Session completion: 3x faster with GPT-5.4
- Token usage: 70% fewer tokens per session
The model navigates portal interfaces, extracts data from varied UI layouts, handles authentication flows, and manages edge cases like captchas or multi-step forms.
Enhanced Visual Perception
GPT-5.4 introduced original image input detail level supporting:
- Up to 10.24M total pixels
- 6000-pixel maximum dimension
- Full-fidelity perception for dense, high-resolution images
The high detail level supports up to 2.56M total pixels or 2048-pixel maximum dimension. Early API user testing showed strong gains in localization ability, image understanding, and click accuracy with original or high detail settings.
Document Parsing Improvements
Better visual perception translates to document handling. GPT-5.4 parses:
- Multi-page PDFs with tables and figures
- Scanned documents with varied layouts
- Screenshots containing text and UI elements
- Technical diagrams and charts
The 22% improvement on OmniDocBench (0.140 to 0.109 error rate) reflects this capability.
Coding and Development Features
GPT-5.4 inherits GPT-5.3-Codex's coding excellence while adding computer use for integrated development workflows.
Frontend Development
Internal evaluations found GPT-5.4 excels at complex frontend tasks with noticeably more aesthetic and functional results than previous models. The experimental Playwright Interactive skill in Codex demonstrates this:
Example: Theme Park Simulation A single prompt generated an isometric theme park simulation with:
- Tile-based path placement
- Ride and scenery construction
- Guest pathfinding and queueing
- Park metrics (money, guests, happiness, cleanliness)
- Browser playtesting via Playwright automation
- Image generation for isometric assets
The model built the game, then used Playwright to automate playtests, verifying placement, navigation, guest reactions, and UI stability across multiple rounds.
Fast Mode for Developers
GPT-5.4 in Codex supports /fast mode delivering up to 1.5x faster token velocity. API developers access equivalent speeds through priority processing. This maintains the same intelligence while reducing iteration time during debugging and development.
Context Window Support
GPT-5.4 Codex includes experimental 1M token context window support. Configure via:
model_context_windowparametermodel_auto_compact_token_limitparameter
Requests exceeding the standard 272K context count against usage limits at 2x the normal rate. This enables analysis of entire codebases, large documentation sets, or multi-file projects in a single request.
Apidog for API Documentation: When working with large codebases and API integrations, keep your API documentation synchronized with implementation. Apidog can import OpenAPI/Swagger specs, generate interactive documentation, and sync with your codebase to ensure API docs stay current as you integrate GPT-5.4 features.

Tool Integration and Search
Tool search represents a fundamental shift in how models interact with external tools and MCP servers.
How Tool Search Works
Previous approach: All tool definitions loaded into every request upfront. For systems with many tools, this added thousands to tens of thousands of tokens, increasing costs and slowing responses.
Tool search approach: Model receives a lightweight list of available tools. When needed, it looks up specific tool definitions and appends them to the conversation at that moment.
Token Savings Example
Scale's MCP Atlas benchmark tested 250 tasks with all 36 MCP servers enabled:

Token breakdown without tool search:
- 65,320 upfront input tokens (tool definitions)
- Additional tokens from tool outputs
- Output tokens
Tool search eliminates the upfront cost while preserving cache efficiency.
MCP Atlas Performance
On MCP Atlas benchmark (250 tasks, 36 MCP servers):
- GPT-5.4: 67.2% accuracy
- GPT-5.2: 60.6% accuracy
The model works with larger tool ecosystems without sacrificing accuracy or overwhelming context windows.
Agentic Tool Calling
Toolathlon benchmark tests multi-step tool workflows (reading emails, extracting attachments, uploading files, grading, recording results):

Tool yields (waiting for tool responses) better reflect latency than tool call counts because they capture parallelization benefits. GPT-5.4 completes tasks in fewer rounds.
GPT-5.4 vs GPT-5.3-Codex vs GPT-5.2
Choosing between models depends on your specific requirements.
When to Use GPT-5.4
- Computer use required - Native computer operation, browser automation
- Knowledge work - Spreadsheets, presentations, documents
- Tool-heavy workflows - MCP servers, external APIs, multi-step automation
- Cost-sensitive at scale - Token efficiency reduces total costs despite higher per-token pricing
- Long-context needs - Up to 1M tokens for complex codebases
When GPT-5.3-Codex Remains Competitive
- Pure coding tasks - Similar SWE-Bench Pro performance (56.8% vs 57.7%)
- Established Codex workflows - Existing integrations may not need computer use
- Cost optimization - If GPT-5.3-Codex pricing remains lower
When GPT-5.2 Suffices
- Simple queries - Basic Q&A, summarization, straightforward generation
- Budget constraints - Lower per-token costs ($1.75/$14 vs $2.50/$15)
- Non-agentic workflows - Single-turn requests without tool use
Pricing Comparison
Model | Input Price | Cached Input | Output Price |
|---|---|---|---|
GPT-5.2 | $1.75/M | $0.175/M | $14/M |
GPT-5.4 | $2.50/M | $0.25/M | $15/M |
GPT-5.2 Pro | $21/M | - | $168/M |
GPT-5.4 Pro | $30/M | - | $180/M |
Batch and Flex pricing available at 50% of standard rates. Priority processing at 200% of standard rates.
Availability and Access Options
GPT-5.4 rolled out gradually starting March 5, 2026 across ChatGPT, Codex, and API.
ChatGPT Access
GPT-5.4 Thinking available to:
- ChatGPT Plus subscribers
- ChatGPT Team subscribers
- ChatGPT Pro subscribers
GPT-5.4 Pro available to:
- ChatGPT Pro subscribers
- ChatGPT Enterprise subscribers
Legacy access: GPT-5.2 Thinking remains available for three months under Legacy Models section, retiring June 5, 2026.
Enterprise and Education: Early access available via admin settings.
Codex Access
GPT-5.4 is the default model in Codex with:
- Experimental 1M context window support
- Playwright Interactive skill for browser playtesting
- /fast mode for 1.5x token velocity
API Access
Model names:
gpt-5.4- Standard modelgpt-5.4-pro- Pro model for complex tasks
Context windows:
- Standard: 272K tokens
- Extended: Up to 1M tokens (experimental, 2x usage rate)
Pricing:
- Standard: $2.50/M input, $0.25/M cached input, $15/M output
- Pro: $30/M input, $180/M output
- Batch/Flex: 50% discount
- Priority: 2x standard rate
Deprecation Timeline
GPT-5.2 Thinking retires June 5, 2026. Migrate workflows before this date to avoid disruption.
Conclusion
GPT-5.4 delivers measurable improvements across knowledge work, computer use, and coding tasks. The 83% GDPval win rate, 75% OSWorld-Verified score, and 57.7% SWE-Bench Pro accuracy establish it as the new state of the art for professional AI workflows.
For developers integrating GPT-5.4 into applications, having robust API testing and debugging tools becomes essential. Apidog streamlines the integration process with unified API design, debugging, testing, and documentation capabilities. Whether you're building AI agents, automating workflows, or creating customer-facing features powered by GPT-5.4, Apidog helps ensure your API integrations work correctly from day one.
Key takeaways:
- 33% reduction in false claims and 18% fewer response errors
- 47% token reduction in tool-heavy workflows
- 75% computer use success rate, surpassing human baseline
- Native computer operation via mouse/keyboard commands
- Tool search enables work with tens of thousands of tools
- 1M token context window for complex codebases
- Available at $2.50/$15 per million tokens (standard variant)
When to adopt:
- You need computer use or browser automation
- Token efficiency matters for high-volume workflows
- Factual accuracy is critical (legal, financial, technical)
- You work with large tool ecosystems or MCP servers
- Long-context analysis of codebases or documents
When to wait:
- Simple Q&A workflows don't benefit from new capabilities
- Budget constraints prioritize lowest per-token costs
- Existing GPT-5.2 or GPT-5.3-Codex workflows perform adequately
GPT-5.4 represents OpenAI's most efficient reasoning model to date. The combination of reduced hallucinations, improved token efficiency, and native computer use capabilities justifies the higher per-token pricing for professional applications.
FAQ
What is the difference between GPT-5.4 and GPT-5.2?
GPT-5.4 achieves 83% win rate on knowledge work vs 70.9% for GPT-5.2, uses significantly fewer tokens, has native computer use capabilities, and reduces factual errors by 33%. Pricing is higher ($2.50/$15 vs $1.75/$14) but total costs may be lower due to efficiency gains.
How much does GPT-5.4 API cost?
GPT-5.4 costs $2.50 per million input tokens, $0.25 per million cached input tokens, and $15 per million output tokens. GPT-5.4 Pro costs $30/M input and $180/M output. Batch and Flex pricing offer 50% discounts.
Does GPT-5.4 have a context window limit?
Standard context window is 272K tokens. Experimental 1M token context window support is available in Codex by configuring model_context_window and model_auto_compact_token_limit parameters. Requests exceeding 272K count at 2x usage rate.
What is GPT-5.4 Pro used for?
GPT-5.4 Pro targets maximum performance on complex reasoning tasks. It scores higher on benchmarks like BrowseComp (89.3% vs 82.7%) and GDPval (82.0% vs 83.0% standard) but costs 12x more ($30/$180 vs $2.50/$15).
When did GPT-5.4 release?
GPT-5.4 released March 5, 2026, rolling out gradually across ChatGPT, Codex, and API. GPT-5.2 Thinking remains available until June 5, 2026 for migration.
Can GPT-5.4 use computers and browsers?
Yes. GPT-5.4 is OpenAI's first general-purpose model with native computer use capabilities. It issues mouse/keyboard commands, automates browsers via Playwright, and navigates desktop environments through screenshot interpretation.
What is tool search in GPT-5.4?
Tool search allows the model to look up tool definitions on-demand instead of loading all definitions upfront. This reduces token usage by 47% in tool-heavy workflows and enables work with ecosystems containing tens of thousands of tools.
How does GPT-5.4 compare to GPT-5.3-Codex for coding?
GPT-5.4 matches or exceeds GPT-5.3-Codex on SWE-Bench Pro (57.7% vs 56.8%) while offering lower latency and adding computer use capabilities. It's the recommended choice for new development workflows.
Is GPT-5.4 available in ChatGPT?
Yes. GPT-5.4 Thinking is available to Plus, Team, and Pro subscribers. GPT-5.4 Pro is available to Pro and Enterprise plans. GPT-5.2 Thinking remains available under Legacy Models until June 5, 2026.
What are the safety considerations for GPT-5.4?
GPT-5.4 is treated as High cyber capability under OpenAI's Preparedness Framework. Protections include expanded cyber safety stack, monitoring systems, trusted access controls, and asynchronous blocking for higher-risk requests on Zero Data Retention surfaces. Some false positives may occur as classifiers improve.



