Qwen 3.7 Plus: Alibaba's multimodal agent model, benchmarks and pricing

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Alibaba shipped Qwen 3.7 Plus just few days after Qwen3.7-Max. The short version: Plus is Max with eyes. It keeps the same 1M-token context and agentic backbone, adds image and video input, and lands at roughly a sixth of Max’s price. If you’ve been following the family, our guide to what Qwen 3.7 is covers the text flagship; this post is about what the new Plus variant adds.

One thing to flag up front, because it changes who should care: Qwen 3.7 Plus is API-only and proprietary. There are no open weights, which breaks from Qwen’s open-source habit. We’ll get to what that means below. Since Plus ships only as an API, you’ll spend your time calling and debugging it; that’s where Apidog comes in, covered at the end.

button

The short answer

Qwen 3.7 Plus is the multimodal, budget-priced sibling of Qwen3.7-Max. Hand it a screenshot, a design mockup, or a video, and it reasons over them as a first-class input. It’s built for agents that drive graphical interfaces: it can look at an app screenshot and return exact pixel coordinates to click.

On pure text, Max still edges it slightly. On anything with a visual signal, Plus is the one you want, and it costs a fraction of Max either way. The only real downside is the closed weights.

What’s new versus Qwen 3.7 Max

Three changes matter.

It sees. Max is text-only. Plus accepts text, images, and video. That unlocks screenshot perception, document and PDF reading, and video understanding from a single model.

It grounds GUIs. Plus is positioned as a multimodal interactive agent that handles browser automation, GUI navigation, and hybrid GUI-plus-CLI workflows. It produces structured action plans like “click at (x=487, y=232),” which is what makes computer-use agents actually work.

It’s cheap. Plus runs at a budget tier well below Max.

	Qwen 3.7 Plus	Qwen 3.7 Max
Input modalities	Text, image, video	Text only
Context window	1M tokens (shared with vision)	1M tokens
Input / output per 1M	$0.40 / $1.60	$2.50 / $7.50
Cached input per 1M	$0.08	$0.25
GUI grounding (ScreenSpot Pro)	79.0	None
Terminal-Bench	70.3	69.7
Autonomous run ceiling	35 hours	35 hours

Benchmarks

The launch numbers, backed up by early hands-on reviews, tell a consistent story: Plus matches or slightly trails Max on text, then pulls ahead the moment vision enters the picture.

ScreenSpot Pro: 79.0. This is the GUI-grounding test, the model’s ability to look at a screenshot and produce exact pixel coordinates. 79.0 is frontier-tier, and Max can’t run it at all.
Terminal-Bench: 70.3. Slightly ahead of Max’s 69.7, even with the added vision parameters.
SWE-Bench Pro: about 60%, essentially level with Max’s 60.6%.
MCP-Atlas: 76.4, a tie with Max on tool-use orchestration.
LM Arena: Plus sits a little behind Max on text (#15 vs #13) and coding (#12 vs #10). For pure-text work, Max keeps a small edge.

The pattern is clear. Pick Plus when the task carries a visual signal: a screenshot, a mockup, a chart. For a head-to-head on the text side, our Qwen 3.7 vs GPT-5.5 vs Opus 4.7 comparison covers where the family lands against the Western flagships. As always, benchmark numbers come from the vendor and early reviewers, so treat them as direction rather than gospel.

Pricing: the budget multimodal tier

Here’s where Plus gets interesting. At $0.40 input and $1.60 output per million tokens, it’s roughly six times cheaper than Max on input and nearly five times cheaper on output. Cached input drops to $0.08. You get vision and a 1M context for less than most text-only models charge.

One caveat worth building into your cost model: images and video share that 1M-token budget. A high-resolution screenshot can burn thousands of tokens, and video frames add up fast, so your effective text headroom shrinks as the visual payload grows. Budget for it. For the wider context on why Chinese labs keep undercutting on price, see our breakdown of the 2026 Chinese LLM price war.

The catch: proprietary and API-only

Qwen built its enterprise traction on open weights. Much of the earlier Qwen line shipped under Apache 2.0 or open-use licenses, so teams could download, fine-tune, and run models inside air-gapped data centers. Qwen 3.7 Plus does not do that.

Plus is delivered strictly as a managed commercial API through Alibaba Cloud Model Studio. You can’t download the weights, you can’t self-host, and you can’t run it offline. For regulated or air-gapped environments, that’s a hard stop. An open-weight Plus variant has been floated for Q3 2026, but it isn’t confirmed, and the proprietary tier may stay closed. If open weights are a requirement, this model isn’t your pick today; rivals like Step 3.7 Flash ship under Apache 2.0 and undercut it on price.

How to access Qwen 3.7 Plus

Two paths:

API: call it through Alibaba Cloud Model Studio. The endpoint is OpenAI-compatible, so the request patterns from the base model carry over; our how to use the Qwen 3.7 API guide walks through auth and the first call, and you add image or video parts to the message payload for multimodal requests.
Chat: try it in the browser at chat.qwen.ai before you write any code. If you want to test the family without a bill, our Qwen 3.7 for free guide shows the free routes.

A minimal multimodal call uses the standard OpenAI message format, with an image part added alongside the text:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODEL_STUDIO_KEY",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

resp = client.chat.completions.create(
    model="qwen3.7-plus",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Which button submits this form? Give pixel coordinates."},
            {"type": "image_url", "image_url": {"url": "https://example.com/screenshot.png"}},
        ],
    }],
)
print(resp.choices[0].message.content)

Check the Model Studio docs for the exact model identifier and the regional base URL, since those differ between the international and China endpoints.

Who should use it

Reach for Qwen 3.7 Plus when your work looks like this:

Computer-use and GUI agents that click through real interfaces from screenshots.
Screenshot-to-code and mockup-to-UI, where the model reads a design and writes the front end.
Document, PDF, and video understanding at a low per-token cost.
Long agentic runs, up to the 35-hour ceiling with thousands of sequential tool calls.

Stick with Max if you’re optimizing purely for SWE-Bench Pro text scores or need the fastest text-only latency, where it runs a bit quicker on cold paths. For most mixed workloads, the cheaper multimodal option is the sensible default. If you’re weighing Plus against other open and budget models, our MiniMax M3 vs DeepSeek V4 vs Qwen 3.7 comparison is a useful map.

Testing Qwen 3.7 Plus with Apidog

Because Plus is API-only, you live in the API. Multimodal requests are fiddly: you’re encoding images, attaching video, and reading back structured action plans, often inside a tool-calling loop that runs for minutes or hours. You need to see exactly what each request sends and what comes back.

Apidog is built for that. Send Qwen 3.7 Plus requests with image and video payloads, inspect the raw responses, manage your Model Studio keys across environments, and mock the endpoint so your app keeps building while you tune prompts. For the agentic side, where Plus chains tool calls across a GUI-and-CLI workflow, Apidog’s AI agent debugger shows the full call sequence so you can find where a run went wrong.

Download Apidog to test, debug, and mock the Qwen 3.7 Plus API before it reaches production.

FAQ

Is Qwen 3.7 Plus open source? No. It’s proprietary and available only as a managed API through Alibaba Cloud Model Studio. You can’t download or self-host the weights. An open-weight variant has been suggested for Q3 2026 but isn’t confirmed.

Qwen 3.7 Plus or Max, which should I use? Use Plus if you need vision (screenshots, PDFs, video) or want the lower price, which covers most workloads. Use Max if you’re tuning for pure-text SWE-Bench Pro scores or need the fastest text-only latency.

How much does Qwen 3.7 Plus cost? $0.40 per million input tokens, $1.60 per million output tokens, and $0.08 for cached input. That’s roughly six times cheaper than Qwen3.7-Max.

Does Qwen 3.7 Plus handle video? Yes. It accepts text, images, and video as input. Remember that visual tokens share the 1M-token context budget, so large media payloads reduce your text headroom.

What’s the context window? 1M tokens, inherited from the Max backbone, shared across text, image, and video tokens.

How do I access Qwen 3.7 Plus? Through the Alibaba Cloud Model Studio API, or try it in the browser at chat.qwen.ai.

Another Chinese multimodal release worth benchmarking against Qwen isZhipu AI's GLM-4.6V, which exposes a similar vision-language API and competes in the same quality tier.

The bottom line

Qwen 3.7 Plus takes Alibaba’s agentic flagship, bolts on vision, and cuts the price to a budget tier. For builders shipping computer-use agents, screenshot-driven coding, or video understanding, it’s one of the cheapest frontier-tier multimodal options available. The trade you accept is closed weights and a hard dependency on Alibaba’s cloud.

If that trade works for you, the next step is the API itself. Test it, debug the multimodal calls, and mock the responses in Apidog so what you ship holds up under real traffic.

button

In this article

The short answer What’s new versus Qwen 3.7 Max Benchmarks Pricing: the budget multimodal tier The catch: proprietary and API-only How to access Qwen 3.7 Plus Who should use it Testing Qwen 3.7 Plus with Apidog FAQ The bottom line

Apidog: A Real Design-first API Development Platform

API Design

API Documentation

API Debugging

Automated Testing

API Mocking

More

Get Started for Free

Enterprise

On-Premises or SaaS or EU-hosted

SSO, RBAC & audit logs

SOC 2, GDPR, ISO 27001

Explore Apidog Enterprise

Explore more

What is Gemini 3.5 Flash-Lite?

Gemini 3.5 Flash-Lite is Google's cheapest, fastest Gemini tier: $0.30 input, ~350 tokens/sec. Get the specs, pricing, benchmarks, and how to test it.

22 July 2026

Gemini 3.6 Flash pricing: what it actually costs in 2026

Gemini 3.6 Flash pricing explained: $1.50/1M input, $7.50/1M output (thinking tokens included), caching costs, the free tier, and a worked monthly cost example.

22 July 2026

What is Gemini 3.6 Flash?

Gemini 3.6 Flash is Google's new workhorse model, GA July 21 2026. Cheaper and more token-efficient than 3.5 Flash. Specs, benchmarks, pricing, and access.

22 July 2026