How to build a computer-use agent with Qwen 3.7 Plus

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Qwen 3.7 Plus scores 79.0 on ScreenSpot Pro, the benchmark for looking at a screenshot and returning the exact pixel coordinates to click. That single skill is what turns a chat model into a computer-use agent: software that sees a screen, decides what to do, and does it. This guide builds a working one in Python, end to end.

We’ll cover the agent loop, the prompt that gets reliable actions out of the model, a runnable browser example with Playwright, and the cost and safety guardrails you need before pointing it at anything real. If you want the model background first, see our Qwen 3.7 Plus overview; for the raw request format, the Qwen 3.7 Plus API guide covers multimodal payloads. You’ll test the agent’s calls in Apidog as you go.

TL;DR

A computer-use agent runs a loop: screenshot the screen, send it to Qwen 3.7 Plus with a goal, get back a structured action like click (x, y), execute that action with a driver such as Playwright, then repeat until the goal is met. Plus is a strong fit because of its GUI grounding and low multimodal price. The hard parts aren’t the model; they’re capping the loop, scaling coordinates, controlling token cost, and sandboxing actions so a wrong click can’t do damage.

What a computer-use agent actually does

Strip away the hype and it’s four steps on repeat:

Perceive: capture a screenshot of the current screen or page.
Decide: send the screenshot and the goal to the model, and get the next action.
Act: execute that action (click, type, scroll) through an automation driver.
Check: take a new screenshot and decide whether the goal is done.

The model is the “decide” step. Everything else is plumbing you control.

0:00

/1:26

Why Qwen 3.7 Plus fits

Three reasons. Its GUI grounding is frontier-tier, so it returns usable coordinates instead of vague descriptions. It handles hybrid GUI-and-CLI workflows, so the same agent can click a button and run a shell command. And at $0.40 per million input tokens, it’s cheap enough to run the many vision calls an agent loop needs. For how it stacks against the text-only flagship, see our Qwen 3.7 Plus vs Max comparison.

The decide step: getting a clean action

The trick is to constrain the model to a small action vocabulary and force JSON output. Loose prose is hard to execute; a strict schema is not.

import os, json, base64
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DASHSCOPE_API_KEY"],
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

SYSTEM = """You are a GUI agent. You see a screenshot and a goal.
Reply with ONE JSON action and nothing else:
{"action": "click", "x": <int>, "y": <int>}
{"action": "type", "text": "<string>"}
{"action": "scroll", "dy": <int>}
{"action": "done", "reason": "<string>"}
Coordinates are pixels in the screenshot you were given."""

def next_action(goal, png_bytes):
    b64 = base64.b64encode(png_bytes).decode()
    resp = client.chat.completions.create(
        model="qwen3.7-plus",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": [
                {"type": "text", "text": f"Goal: {goal}"},
                {"type": "image_url",
                 "image_url": {"url": f"data:image/png;base64,{b64}"}},
            ]},
        ],
    )
    return json.loads(resp.choices[0].message.content)

Confirm the exact model ID in the Model Studio docs before shipping, since identifiers shift.

The full loop with Playwright

Playwright drives a real browser, so the agent acts on actual pages. One detail saves you a lot of pain: make the screenshot resolution match the viewport, so the coordinates the model returns map one to one and you skip the scaling math.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page(viewport={"width": 1280, "height": 800})
    page.goto("https://example.com")

    goal = "Open the pricing page and find the cheapest plan"

    for step in range(15):                 # hard cap on steps
        shot = page.screenshot()           # 1280x800 PNG, matches viewport
        action = next_action(goal, shot)
        print(step, action)

        if action["action"] == "done":
            break
        if action["action"] == "click":
            page.mouse.click(action["x"], action["y"])
        elif action["action"] == "type":
            page.keyboard.type(action["text"])
        elif action["action"] == "scroll":
            page.mouse.wheel(0, action["dy"])

        page.wait_for_timeout(800)         # let the UI settle

    browser.close()

That’s a real agent. It will navigate a site toward a goal, one grounded action at a time. The same pattern works for desktop apps if you swap Playwright for a desktop driver and screenshot the OS window instead.

Cost and reliability

Screenshots are the expensive part. Each one is converted to tokens, and a 1280-wide image runs to a few thousand tokens, so a 15-step loop sends real money through the API. Keep it down:

Downscale and crop. Send the smallest image the model can still read. Crop to the relevant panel when you can.
Cap the loop. Always bound the step count, as the example does, so a confused agent can’t run forever.
Verify after acting. Treat each action as a hypothesis. The next screenshot confirms whether it worked, and the loop self-corrects.

Our guide on reducing agent token costs goes deeper, and our notes on agentic workflow wiring cover where these loops break in practice.

When the agent gets stuck

Three failures show up constantly, and each has a cheap fix:

The model returns prose instead of JSON. Re-prompt with a short “reply with JSON only” reminder and retry once before giving up. A strict schema plus a repair step catches almost all of these.
A click misses its target. The next screenshot shows nothing changed, so add a rule that retries with a fresh screenshot instead of blindly repeating the same coordinates.
The loop spins without progress. Track the last few actions; if they repeat, stop and surface the screenshot for a human. The step cap is your backstop.

Safety

A computer-use agent clicks things for real. Before it touches anything that matters:

Run it in a sandbox or a throwaway browser profile, never your logged-in production session.
Require human confirmation for destructive actions like delete, send, or pay.
Log every action with its screenshot so you can audit what the agent did and why.

Test the agent’s calls with Apidog

Most agent failures trace back to one question: did the model return a valid action? Before you wire up Playwright, pin that down. Use Apidog to send a sample screenshot to Qwen 3.7 Plus, inspect the raw JSON it returns, and tune your system prompt until the action schema comes back clean every time. Store your Model Studio key per environment, and mock the endpoint so you can build the loop without burning tokens on every test run. When the full loop is chaining calls, Apidog’s AI agent debugger shows the sequence so you can find the step that derailed.

To generate UI code from a design instead of driving one, see our companion guide on screenshot-to-code with Qwen 3.7 Plus.

Download Apidog to test and debug the model calls behind your agent.

FAQ

What’s a computer-use agent? Software that perceives a screen through screenshots, decides an action with a model, and executes it through an automation driver, looping until a goal is met.

Can Qwen 3.7 Plus control my desktop? The model only returns actions. You execute them with a driver. Pair it with Playwright for browsers or a desktop automation library for native apps.

How much does each step cost? Mostly the screenshot. A single screen image can run to a few thousand input tokens at $0.40 per million, so downscaling and capping the loop are the main cost levers.

Is it reliable enough for production? For bounded, well-defined tasks with verification after each step, yes. For open-ended control of critical systems, keep a human in the loop and sandbox everything.

Do I need to scale the coordinates? Not if your screenshot resolution matches your viewport. If they differ, scale the returned coordinates by the ratio between them.

The bottom line

A computer-use agent is a short loop around one capable model, and Qwen 3.7 Plus gives you the grounding and the price to run it. Build the loop, cap it, sandbox it, and verify each step. Then test the model calls in Apidog so the “decide” step is solid before the agent starts clicking.

button

In this article

TL;DR What a computer-use agent actually does Why Qwen 3.7 Plus fits The decide step: getting a clean action The full loop with Playwright Cost and reliability When the agent gets stuck Safety Test the agent’s calls with Apidog FAQ The bottom line

Apidog: A Real Design-first API Development Platform

API Design

API Documentation

API Debugging

Automated Testing

API Mocking

More

Get Started for Free

Enterprise

On-Premises or SaaS or EU-hosted

SSO, RBAC & audit logs

SOC 2, GDPR, ISO 27001

Explore Apidog Enterprise

Explore more

Gemini 3.5 Flash-Lite vs 3.6 Flash: which one should you use?

Gemini 3.5 Flash-Lite vs 3.6 Flash compared: price, speed, benchmarks, a use-case matrix, and a same-workload cost example so you pick the right tier fast.

22 July 2026

Gemini 3.6 Flash vs 3.5 Flash: what changed and should you upgrade?

Gemini 3.6 Flash vs 3.5 Flash: same $1.50 input, output cut to $7.50, 17% fewer output tokens, higher computer-use scores. What changed and should you upgrade?

22 July 2026

How to use Gemini 3.6 Flash for free

Use Gemini 3.6 Flash for free two ways: the Gemini app and the free API tier in Google AI Studio. Real rate limits, the data-use catch, and when to pay.

22 July 2026