How to Use Open AI’s GPT-OSS-120B with API

Hey, AI enthusiasts! Buckle up because Open AI just dropped a bombshell with their new open-weight model, GPT-OSS-120B, and it’s turning heads in the AI community. Released under the Apache 2.0 license, this powerhouse is designed for reasoning, coding, and agentic tasks, all while running on a single GPU. In this guide, we’ll dive into what makes GPT-OSS-120B special, its stellar benchmarks, affordable pricing, and how you can use it via the OpenRouter API. Let’s explore this open-source gem and get you coding with it in no time!

💡

Want a great API Testing tool that generates beautiful API Documentation?

Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?

Apidog delivers all your demands, and replaces Postman at a much more affordable price!

button

What Is GPT-OSS-120B?

Open AI’s GPT-OSS-120B is a 117-billion-parameter language model (with 5.1 billion active per token) that’s part of their new open-weight GPT-OSS series, alongside the smaller GPT-OSS-20B. Released on August 5, 2025, it’s a Mixture-of-Experts (MoE) model optimized for efficiency, running on a single NVIDIA H100 GPU or even consumer hardware with MXFP4 quantization. It’s built for tasks like complex reasoning, code generation, and tool use, with a massive 128K token context window—think 300–400 pages of text! Under the Apache 2.0 license, you can customize, deploy, or even commercialize it, making it a dream for developers and businesses craving control and privacy.

Benchmarks: How Does GPT-OSS-120B Stack Up?

The GPT-OSS-120B is no slouch when it comes to performance. Open AI’s benchmarks show it’s a serious contender against proprietary models like their own o4-mini and even Claude 3.5 Sonnet. Here’s the lowdown:

Reasoning Power: It scores 94.2% on MMLU (Massive Multitask Language Understanding), just shy of GPT-4’s 95.1%, and nails 96.6% on AIME math competitions, outperforming many closed models.
Coding Prowess: On Codeforces, it boasts a 2622 Elo rating, and it achieves an 87.3% pass rate on HumanEval for code generation, making it a coder’s best friend.
Health and Tool Use: It surpasses o4-mini on HealthBench for health-related queries and excels in agentic tasks like TauBench, thanks to its chain-of-thought (CoT) reasoning and tool-calling capabilities.

Speed: On an H100 GPU, it processes 45 tokens per second, with providers like Cerebras hitting up to 3,000 tokens/sec for high-volume needs. OpenRouter delivers ~500 tokens/sec, outpacing many closed models.

These stats show GPT-OSS-120B is near-parity with top-tier proprietary models while being open and customizable. It’s a beast for math, coding, and general problem-solving, with safety baked in through adversarial fine-tuning to keep risks low.

Pricing: Affordable and Transparent

One of the best parts about GPT-OSS-120B? It’s cost-effective, especially compared to proprietary models. Here’s how it breaks down across major providers, based on recent data for a 131K context window:

Local Deployment: Run it on your own hardware (e.g., an H100 GPU or 80GB VRAM setup) for zero API costs. A GMKTEC EVO-X2 setup costs ~€2000 and uses less than 200W, perfect for small companies prioritizing privacy.
Baseten: $0.10/M input tokens, $0.50/M output tokens. Latency: 0.20s, Throughput: 491.1 tokens/sec. Max output: 131K tokens.
Fireworks: $0.15/M input, $0.60/M output. Latency: 0.56s, Throughput: 258.9 tokens/sec. Max output: 33K tokens.
Together: $0.15/M input, $0.60/M output. Latency: 0.28s, Throughput: 131.1 tokens/sec. Max output: 131K tokens.
Parasail: $0.15/M input, $0.60/M output (FP4 quantization). Latency: 0.40s, Throughput: 94.3 tokens/sec. Max output: 131K tokens.
Groq: $0.15/M input, $0.75/M output. Latency: 0.24s, Throughput: 1,065 tokens/sec. Max output: 33K tokens.
Cerebras: $0.25/M input, $0.69/M output. Latency: 0.42s, Throughput: 1,515 tokens/sec. Max output: 33K tokens. Ideal for high-speed needs, hitting up to 3,000 tokens/sec in some setups.

With GPT-OSS-120B, you get high performance at a fraction of GPT-4’s cost (~$20.00/M tokens), with providers like Groq and Cerebras offering blazing-fast throughput for real-time applications.

How to Use GPT-OSS-120B with Cline via OpenRouter

Want to harness the power of GPT-OSS-120B for your coding projects? While Claude Desktop and Claude Code do not support direct integration with OpenAI models like GPT-OSS-120B due to their reliance on Anthropic’s ecosystem, you can easily use this model with Cline, a free, open-source VS Code extension, via the OpenRouter API. Additionally, Cursor has recently restricted its Bring Your Own Key (BYOK) option for non-Pro users, locking features like Agent and Edit modes behind a $20/month subscription, making Cline a more flexible alternative for BYOK users. Here’s how to set up GPT-OSS-120B with Cline and OpenRouter, step by step.

Step 1: Get an OpenRouter API Key

Sign Up with OpenRouter:

Visit openrouter.ai and create a free account using Google or GitHub.

2. Find GPT-OSS-120B:

In the Models tab, search for “gpt-oss-120b” and select it.

3. Generate an API Key:

Go to the Keys section, click Create API Key, name it (e.g., “GPT-OSS-Cursor”), and copy it. Save it securely

Step 2: Use Cline in VS Code with BYOK

For unrestricted BYOK access, Cline (an open-source VS Code extension) is a fantastic Cursor alternative. It supports GPT-OSS-120B via OpenRouter without feature lockouts. Here’s how to set it up:

Install Cline:

Open VS Code (code.visualstudio.com).
Go to the Extensions panel (Ctrl+Shift+X or Cmd+Shift+X).
Search for “Cline” and install it (by nickbaumann98, github.com/cline/cline).

2. Configure OpenRouter:

Open the Cline panel (click the Cline icon in the Activity Bar).
Click the gear icon in the Cline panel.
Select OpenRouter as the provider.
Paste your OpenRouter API key.
Choose openai/gpt-oss-120b as the model.

3. Save and Test:

Save settings. In the Cline chat panel, try:

Generate a JavaScript function to parse JSON data.

Expect a response like:

function parseJSON(data) {
  try {
    return JSON.parse(data);
  } catch (e) {
    console.error("Invalid JSON:", e.message);
    return null;
  }
}

Test codebase queries:

Summarize src/api/server.js

Cline will analyze your project and return a summary, leveraging GPT-OSS-120B’s 128K context window.

Why Cline Over Cursor or Claude?

No Claude Integration: Claude Desktop and Claude Code are locked to Anthropic’s models (e.g., Claude 3.5 Sonnet) and don’t support OpenAI models like GPT-OSS-120B due to ecosystem restrictions.
Cursor’s BYOK Restrictions: Cursor’s recent ban on BYOK for non-Pro users means you can’t access Agent or Edit modes without a $20/month subscription, even with a valid OpenRouter API key. Cline has no such limits, offering full feature access for free with your API key.
Privacy and Control: Cline sends requests directly to OpenRouter, bypassing third-party servers (unlike Cursor’s AWS routing), enhancing privacy.

Troubleshooting Tips

Invalid API Key? Verify your key in OpenRouter’s dashboard and ensure it’s active.
Model Not Available? Check OpenRouter’s model list for openai/gpt-oss-120b. If missing, try providers like Fireworks AI or contact OpenRouter support.
Slow Responses? Ensure your internet is stable. For faster performance, consider lighter models like GPT-OSS-20B.
Cline Errors? Update Cline via the Extensions panel and check logs in VS Code’s Output panel.

Why Use GPT-OSS-120B?

The GPT-OSS-120B model is a game-changer for developers and businesses, offering a compelling mix of performance, flexibility, and cost-efficiency. Here’s why it stands out:

Open-Source Freedom: Licensed under Apache 2.0, you can fine-tune, deploy, or commercialize GPT-OSS-120B without restrictions, giving you full control over your AI workflows.
Cost Savings: Run it locally on a single H100 GPU or consumer hardware (80GB VRAM) for zero API costs. Via OpenRouter, pricing is highly competitive at ~$0.50/M input tokens and ~$2.00/M output tokens, a fraction of GPT-4’s ~$20.00/M tokens, offering up to 90% savings for heavy users. Other providers like Groq ($0.15/M input, $0.75/M output) and Cerebras ($0.25/M input, $0.69/M output) also keep costs low.
Performance: It achieves near-parity with OpenAI’s o4-mini, scoring 94.2% on MMLU, 96.6% on AIME math, and 87.3% on HumanEval for coding. Its 128K token context window (300–400 pages) handles massive codebases or documents with ease.

Chain-of-Thought (CoT) Reasoning: The model’s full CoT transparency lets you see its step-by-step reasoning, making it easier to debug outputs and detect biases or errors. You can adjust reasoning effort (low, medium, high) via system prompts (e.g., “Reasoning: high”) for tasks like complex math or coding, balancing speed and depth. This unsupervised CoT design aids researchers in monitoring model behavior without direct supervision, enhancing trust and safety.

Agentic Capabilities: Native support for tool use, like web browsing and Python code execution, makes it ideal for agentic workflows. It can chain multiple tool calls (e.g., 28 consecutive web searches in a demo) for complex tasks like data aggregation or automation.
Privacy: Host it on-premises (e.g., via Dell Enterprise Hub) for complete data control, perfect for enterprises or privacy-conscious users.
Flexibility: Compatible with OpenRouter, Fireworks AI, Cerebras, and local setups like Ollama or LM Studio, it runs on diverse hardware, from RTX GPUs to Apple Silicon.

Community buzz on X highlights its speed (up to 1,515 tokens/sec on Cerebras) and coding prowess, with developers loving its ability to handle multi-file projects and its open-weight nature for customization. Whether you’re building AI agents or fine-tuning for niche tasks, GPT-OSS-120B delivers unmatched value.

Conclusion

Open AI’s GPT-OSS-120B is a revolutionary open-weight model, blending top-tier performance with cost-effective deployment. Its benchmarks rival proprietary models, its pricing is wallet-friendly, and it’s easy to integrate with Cursor or Cline via OpenRouter’s API. Whether you’re coding, debugging, or reasoning through complex problems, this model delivers. Try it out, experiment with its 128K context window, and let us know your cool use cases in the comments—I’m all ears!

For more details, check out the repo at github.com/openai/gpt-oss or Open AI’s announcement at openai.com.

💡

button