How to build a computer-use agent with Qwen 3.7 Plus

Build a working computer-use / GUI agent with Qwen 3.7 Plus: the perceive-decide-act loop, a strict JSON action prompt, a runnable Playwright example, plus cost, reliability, and safety guardrails.

Ashley Innocent

Ashley Innocent

3 June 2026

How to build a computer-use agent with Qwen 3.7 Plus

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Explore Apidog Enterprise

Qwen 3.7 Plus scores 79.0 on ScreenSpot Pro, the benchmark for looking at a screenshot and returning the exact pixel coordinates to click. That single skill is what turns a chat model into a computer-use agent: software that sees a screen, decides what to do, and does it. This guide builds a working one in Python, end to end.

We’ll cover the agent loop, the prompt that gets reliable actions out of the model, a runnable browser example with Playwright, and the cost and safety guardrails you need before pointing it at anything real. If you want the model background first, see our Qwen 3.7 Plus overview; for the raw request format, the Qwen 3.7 Plus API guide covers multimodal payloads. You’ll test the agent’s calls in Apidog as you go.

TL;DR

A computer-use agent runs a loop: screenshot the screen, send it to Qwen 3.7 Plus with a goal, get back a structured action like click (x, y), execute that action with a driver such as Playwright, then repeat until the goal is met. Plus is a strong fit because of its GUI grounding and low multimodal price. The hard parts aren’t the model; they’re capping the loop, scaling coordinates, controlling token cost, and sandboxing actions so a wrong click can’t do damage.

What a computer-use agent actually does

Strip away the hype and it’s four steps on repeat:

  1. Perceive: capture a screenshot of the current screen or page.
  2. Decide: send the screenshot and the goal to the model, and get the next action.
  3. Act: execute that action (click, type, scroll) through an automation driver.
  4. Check: take a new screenshot and decide whether the goal is done.

The model is the “decide” step. Everything else is plumbing you control.

0:00
/1:26

Why Qwen 3.7 Plus fits

Three reasons. Its GUI grounding is frontier-tier, so it returns usable coordinates instead of vague descriptions. It handles hybrid GUI-and-CLI workflows, so the same agent can click a button and run a shell command. And at $0.40 per million input tokens, it’s cheap enough to run the many vision calls an agent loop needs. For how it stacks against the text-only flagship, see our Qwen 3.7 Plus vs Max comparison.

The decide step: getting a clean action

The trick is to constrain the model to a small action vocabulary and force JSON output. Loose prose is hard to execute; a strict schema is not.

import os, json, base64
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DASHSCOPE_API_KEY"],
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

SYSTEM = """You are a GUI agent. You see a screenshot and a goal.
Reply with ONE JSON action and nothing else:
{"action": "click", "x": <int>, "y": <int>}
{"action": "type", "text": "<string>"}
{"action": "scroll", "dy": <int>}
{"action": "done", "reason": "<string>"}
Coordinates are pixels in the screenshot you were given."""

def next_action(goal, png_bytes):
    b64 = base64.b64encode(png_bytes).decode()
    resp = client.chat.completions.create(
        model="qwen3.7-plus",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": [
                {"type": "text", "text": f"Goal: {goal}"},
                {"type": "image_url",
                 "image_url": {"url": f"data:image/png;base64,{b64}"}},
            ]},
        ],
    )
    return json.loads(resp.choices[0].message.content)

Confirm the exact model ID in the Model Studio docs before shipping, since identifiers shift.

The full loop with Playwright

Playwright drives a real browser, so the agent acts on actual pages. One detail saves you a lot of pain: make the screenshot resolution match the viewport, so the coordinates the model returns map one to one and you skip the scaling math.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page(viewport={"width": 1280, "height": 800})
    page.goto("https://example.com")

    goal = "Open the pricing page and find the cheapest plan"

    for step in range(15):                 # hard cap on steps
        shot = page.screenshot()           # 1280x800 PNG, matches viewport
        action = next_action(goal, shot)
        print(step, action)

        if action["action"] == "done":
            break
        if action["action"] == "click":
            page.mouse.click(action["x"], action["y"])
        elif action["action"] == "type":
            page.keyboard.type(action["text"])
        elif action["action"] == "scroll":
            page.mouse.wheel(0, action["dy"])

        page.wait_for_timeout(800)         # let the UI settle

    browser.close()

That’s a real agent. It will navigate a site toward a goal, one grounded action at a time. The same pattern works for desktop apps if you swap Playwright for a desktop driver and screenshot the OS window instead.

Cost and reliability

Screenshots are the expensive part. Each one is converted to tokens, and a 1280-wide image runs to a few thousand tokens, so a 15-step loop sends real money through the API. Keep it down:

Our guide on reducing agent token costs goes deeper, and our notes on agentic workflow wiring cover where these loops break in practice.

When the agent gets stuck

Three failures show up constantly, and each has a cheap fix:

Safety

A computer-use agent clicks things for real. Before it touches anything that matters:

Test the agent’s calls with Apidog

Most agent failures trace back to one question: did the model return a valid action? Before you wire up Playwright, pin that down. Use Apidog to send a sample screenshot to Qwen 3.7 Plus, inspect the raw JSON it returns, and tune your system prompt until the action schema comes back clean every time. Store your Model Studio key per environment, and mock the endpoint so you can build the loop without burning tokens on every test run. When the full loop is chaining calls, Apidog’s AI agent debugger shows the sequence so you can find the step that derailed.

To generate UI code from a design instead of driving one, see our companion guide on screenshot-to-code with Qwen 3.7 Plus.

Download Apidog to test and debug the model calls behind your agent.

FAQ

What’s a computer-use agent? Software that perceives a screen through screenshots, decides an action with a model, and executes it through an automation driver, looping until a goal is met.

Can Qwen 3.7 Plus control my desktop? The model only returns actions. You execute them with a driver. Pair it with Playwright for browsers or a desktop automation library for native apps.

How much does each step cost? Mostly the screenshot. A single screen image can run to a few thousand input tokens at $0.40 per million, so downscaling and capping the loop are the main cost levers.

Is it reliable enough for production? For bounded, well-defined tasks with verification after each step, yes. For open-ended control of critical systems, keep a human in the loop and sandbox everything.

Do I need to scale the coordinates? Not if your screenshot resolution matches your viewport. If they differ, scale the returned coordinates by the ratio between them.

The bottom line

A computer-use agent is a short loop around one capable model, and Qwen 3.7 Plus gives you the grounding and the price to run it. Build the loop, cap it, sandbox it, and verify each step. Then test the model calls in Apidog so the “decide” step is solid before the agent starts clicking.

button

Explore more

Screenshot to code with Qwen 3.7 Plus

Screenshot to code with Qwen 3.7 Plus

Turn a UI screenshot or design mockup into working front-end code with Qwen 3.7 Plus: the prompt that matters, a visual feedback loop for pixel accuracy, token-cost tips, and how to back the UI with tested APIs.

3 June 2026

How to use the Qwen 3.7 Plus API ?

How to use the Qwen 3.7 Plus API ?

A developer guide to the Qwen 3.7 Plus API: get a Model Studio key, send text and multimodal (image/video) requests in Python, curl, and JavaScript, and understand pricing at $0.40/$1.60 per 1M tokens with worked cost examples.

3 June 2026

Devin vs Cursor in 2026: Windsurf is now Devin Desktop

Devin vs Cursor in 2026: Windsurf is now Devin Desktop

Windsurf is now Devin Desktop. Agent Command Center, Spaces, Devin Local, Devin Cloud, ACP, SWE-1.6, plus how it compares to Cursor.

3 June 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs