How to Use Qwen3.5 API for Free with NVIDIA ?

TL;DR

Qwen3.5 is Alibaba's groundbreaking 397-billion parameter vision-language model with Mixture of Experts (MoE) architecture. You can access it for free through NVIDIA's GPU-accelerated endpoints by registering for the NVIDIA Developer Program. This guide walks you through obtaining your API key, making your first calls, and integrating Qwen3.5's multimodal capabilities into your applications.

Introduction

Alibaba's Qwen3.5 represents a significant leap in multimodal AI. This 397-billion parameter model combines Mixture of Experts (MoE) architecture with Gated Delta Networks, delivering powerful reasoning capabilities while keeping active parameters at just 17 billion. The result is a model that can understand images, navigate user interfaces, and handle complex multimodal tasks, all accessible through a free API.

The best part? You can start using Qwen3.5 for free right now through NVIDIA's developer platform. Whether you're building AI agents, developing visual reasoning applications, or exploring multimodal AI, this guide will walk you through every step.

💡

If you're building applications that integrate with Qwen3.5 or any other AI API, you'll need robust testing tools. Apidog provides a comprehensive API testing platform that makes it easy to validate your AI API integrations, manage environment variables, and automate testing workflows.

button

What is Qwen3.5 VLM?

Qwen3.5 is Alibaba's first native vision-language model in the Qwen3.5 series, designed specifically for building autonomous agents. Unlike previous VLMs that were adapted from text-only models, Qwen3.5 was built from the ground up for multimodal reasoning and UI navigation.

Key Specifications

Specification	Value
Total Parameters	397 billion
Active Parameters	17 billion
Activation Rate	4.28%
Expert Count	512 experts
Experts per Token	11 (10 routed + 1 shared)
Input Context	256K (extensible to 1M)
Languages Supported	200+
Architecture	MoE + Gated Delta Networks

What Makes Qwen3.5 Special

Mixture of Experts (MoE) architecture means only a subset of the model's parameters are active for any given input. This makes the model computationally efficient while maintaining the capacity for complex reasoning across all 397B parameters.

Native Multimodal Agent Capabilities set Qwen3.5 apart from other VLMs:

Understands and navigates user interfaces
Performs visual reasoning on mobile and web interfaces
Handles complex coding tasks
Powers chat applications with multimodal understanding

Ideal Use Cases

Coding and Web Development: Write and debug code with visual context
Visual Reasoning: Analyze screenshots, photos, and UI elements
Chat Applications: Build conversational AI with multimodal understanding
Complex Search: Search across images and text simultaneously
UI Automation: Navigate and interact with interfaces autonomously

NVIDIA Developer Program: Get Your Free API Key

NVIDIA provides free access to Qwen3.5 through their GPU-accelerated endpoints. Here's how to get started:

Step 1: Join NVIDIA Developer Program

Visit build.nvidia.com
Click Sign In or Create Account
Register for the NVIDIA Developer Program (free)
Verify your email address

Step 2: Get Your API Key

After logging in, navigate to your account settings
Find API Keys or NVIDIA API Key
Copy your API key (starts with nvapi-)
Store it securely (you'll need it for authentication)

Important: Never expose your API key in client-side code. Use environment variables or a backend server to store it securely.

Step 3: Test Your Access

You can test Qwen3.5 directly in your browser at build.nvidia.com/qwen/qwen3.5-397b-a17b. This lets you experiment with prompts and evaluate the model with your own data before writing any code.

Your First Qwen3.5 API Call

Now let's make your first API call to Qwen3.5. The API is compatible with OpenAI's format, making it easy to integrate into existing applications.

Basic API Call

import requests

# Configuration
invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions"
api_key = "YOUR_NVIDIA_API_KEY"  # Replace with your API key

headers = {
    "Authorization": f"Bearer {api_key}",
    "Accept": "application/json",
}

# Payload - simple text-only request
payload = {
    "messages": [
        {
            "role": "user",
            "content": "What are the key features of Qwen3.5 VLM?"
        }
    ],
    "model": "qwen/qwen3.5-397b-a17b",
    "max_tokens": 1024,
    "temperature": 0.7,
}

# Make the request
session = requests.Session()
response = session.post(invoke_url, headers=headers, json=payload)
response.raise_for_status()

# Print the response
result = response.json()
print(result['choices'][0]['message']['content'])

Making Multimodal Requests (With Images)

To use Qwen3.5's vision capabilities, include image data in your request:

import requests
import base64

# Function to encode image to base64
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Encode your image
image_base64 = encode_image("screenshot.png")

invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions"
api_key = "YOUR_NVIDIA_API_KEY"

headers = {
    "Authorization": f"Bearer {api_key}",
    "Accept": "application/json",
}

# Multimodal request with image
payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{image_base64}"}
                },
                {
                    "type": "text",
                    "text": "What do you see in this image? Describe the UI elements."
                }
            ]
        }
    ],
    "model": "qwen/qwen3.5-397b-a17b",
    "max_tokens": 1024,
}

response = requests.post(invoke_url, headers=headers, json=payload)
result = response.json()
print(result['choices'][0]['message']['content'])

Code Examples in Python and JavaScript

Python: Complete Integration Example

import os
import requests
from requests.exceptions import RequestException

class QwenClient:
    """Python client for Qwen3.5 API"""

    def __init__(self, api_key=None):
        self.api_key = api_key or os.getenv("NVIDIA_API_KEY")
        self.endpoint = "https://integrate.api.nvidia.com/v1/chat/completions"
        self.model = "qwen/qwen3.5-397b-a17b"

    def chat(self, message, system_prompt=None, **kwargs):
        """Send a chat message to Qwen3.5"""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }

        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": message})

        payload = {
            "messages": messages,
            "model": self.model,
            "max_tokens": kwargs.get("max_tokens", 2048),
            "temperature": kwargs.get("temperature", 0.7),
            "top_p": kwargs.get("top_p", 0.9),
        }

        # Enable thinking mode if requested
        if kwargs.get("thinking", False):
            payload["chat_template_kwargs"] = {"thinking": True}

        try:
            response = requests.post(
                self.endpoint,
                headers=headers,
                json=payload,
                timeout=kwargs.get("timeout", 60)
            )
            response.raise_for_status()
            return response.json()
        except RequestException as e:
            return {"error": str(e)}

    def chat_with_image(self, message, image_path, **kwargs):
        """Send a chat message with image to Qwen3.5"""
        import base64

        with open(image_path, "rb") as f:
            image_base64 = base64.b64encode(f.read()).decode("utf-8")

        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }

        payload = {
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
                    {"type": "text", "text": message}
                ]
            }],
            "model": self.model,
            "max_tokens": kwargs.get("max_tokens", 2048),
            "temperature": kwargs.get("temperature", 0.7),
        }

        response = requests.post(self.endpoint, headers=headers, json=payload)
        response.raise_for_status()
        return response.json()


# Usage example
client = QwenClient(api_key="YOUR_NVIDIA_API_KEY")

# Text-only chat
result = client.chat("Explain Mixture of Experts architecture in simple terms")
print(result['choices'][0]['message']['content'])

# Multimodal chat
result = client.chat_with_image(
    "What UI elements are in this screenshot?",
    "screenshot.png"
)
print(result['choices'][0]['message']['content'])

JavaScript/Node.js: Complete Integration Example

const axios = require('axios');

class QwenClient {
  constructor(apiKey) {
    this.apiKey = apiKey;
    this.endpoint = 'https://integrate.api.nvidia.com/v1/chat/completions';
    this.model = 'qwen/qwen3.5-397b-a17b';
  }

  async chat(message, options = {}) {
    const { systemPrompt, temperature = 0.7, maxTokens = 2048, thinking = false } = options;

    const messages = [];
    if (systemPrompt) {
      messages.push({ role: 'system', content: systemPrompt });
    }
    messages.push({ role: 'user', content: message });

    const payload = {
      messages,
      model: this.model,
      temperature,
      max_tokens: maxTokens,
      ...(thinking && { chat_template_kwargs: { thinking: true } })
    };

    try {
      const response = await axios.post(this.endpoint, payload, {
        headers: {
          'Authorization': `Bearer ${this.apiKey}`,
          'Content-Type': 'application/json'
        },
        timeout: 60000
      });

      return response.data;
    } catch (error) {
      console.error('API Error:', error.response?.data || error.message);
      throw error;
    }
  }

  async chatWithImage(message, imageBase64, options = {}) {
    const { temperature = 0.7, maxTokens = 2048 } = options;

    const payload = {
      messages: [{
        role: 'user',
        content: [
          { type: 'image_url', image_url: { url: `data:image/png;base64,${imageBase64}` } },
          { type: 'text', text: message }
        ]
      }],
      model: this.model,
      temperature,
      max_tokens: maxTokens
    };

    const response = await axios.post(this.endpoint, payload, {
      headers: {
        'Authorization': `Bearer ${this.apiKey}`,
        'Content-Type': 'application/json'
      }
    });

    return response.data;
  }
}

// Usage
const client = new QwenClient(process.env.NVIDIA_API_KEY);

// Text chat
const result = await client.chat('What is the advantage of MoE architecture?');
console.log(result.choices[0].message.content);

// With thinking enabled
const deepResult = await client.chat('Explain how reasoning works in LLMs', {
  thinking: true
});
console.log(deepResult.choices[0].message.content);

Advanced Features: Thinking Mode and Tool Calling

Thinking Mode

Qwen3.5 supports an advanced "thinking" mode that enables the model to show its reasoning process. This is particularly useful for complex problem-solving tasks.

payload = {
    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120km in 2 hours, what is its speed?"}],
    "model": "qwen/qwen3.5-397b-a17b",
    "chat_template_kwargs": {"thinking": True},
    "max_tokens": 4096,
}

response = session.post(invoke_url, headers=headers, json=payload)
result = response.json()
print(result['choices'][0]['message']['content'])

Tool Calling

Qwen3.5 supports function calling through OpenAI-compatible tools. This enables you to build agentic applications that can execute real actions.

import json

# Define tools for the model to use
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }
]

payload = {
    "messages": [
        {"role": "user", "content": "What's the weather like in Tokyo?"}
    ],
    "model": "qwen/qwen3.5-397b-a17b",
    "tools": tools,
    "tool_choice": "auto"
}

response = session.post(invoke_url, headers=headers, json=payload)
result = response.json()

# Check if model wants to call a tool
if 'tool_calls' in result['choices'][0]['message']:
    tool_call = result['choices'][0]['message']['tool_calls'][0]
    print(f"Model wants to call: {tool_call['function']['name']}")
    print(f"Arguments: {tool_call['function']['arguments']}")

Understanding Rate Limits and Pricing

Current Free Tier (NVIDIA Developer Program)

Feature	Limit
API Access	Free with registration
GPU-Accelerated Endpoints	Included
Browser Testing	Unlimited
Rate Limits	Check developer dashboard

What This Means for You

No credit card required: Just register for the free NVIDIA Developer Program
GPU-accelerated: Requests run on NVIDIA Blackwell GPUs
Production-ready: Same endpoints used for production workloads

Scaling to Production

When you're ready to move beyond free tier:

NVIDIA NIM: Deploy containerized models anywhere (cloud, on-premises, hybrid)
NeMo: Customize the model for your specific domain
Enterprise support: Contact NVIDIA for dedicated infrastructure

Production Deployment with NVIDIA NIM

NVIDIA NIM (NVIDIA Inference Microservices) makes it easy to take Qwen3.5 from development to production.

What is NIM?

NIM provides pre-built, optimized containers for AI inference. Each NIM microservice packages:

The model with performance optimizations
Standardized APIs (OpenAI-compatible)
Deployment flexibility (cloud, on-premises, edge)

Deploying Qwen3.5 with NIM

# Pull the Qwen3.5 NIM container
docker pull nvcr.io/nim/qwen/qwen3.5-397b-a17b:latest

# Run the container
docker run --gpus all --rm -p 8000:8000 \
  -e NVIDIA_API_KEY=$NVIDIA_API_KEY \
  nvcr.io/nim/qwen/qwen3.5-397b-a17b:latest

Now your model is running locally at http://localhost:8000/v1/chat/completions.

Benefits of NIM

Anywhere deployment: Run on-premises, in cloud, or hybrid
Optimized performance: Tuned for NVIDIA GPU inference
Consistent APIs: OpenAI-compatible interface
Scalable: Scale from dev to production seamlessly

Customization with NVIDIA NeMo

For domain-specific applications, you can fine-tune Qwen3.5 using NVIDIA NeMo.

NeMo Framework Capabilities

High-throughput fine-tuning: PyTorch-native training
LoRA support: Memory-efficient customization
Multinode training: Slurm and Kubernetes support
Hugging Face integration: Direct training on existing checkpoints

Example: Fine-tuning for Medical VQA

NVIDIA provides a technical tutorial for fine-tuning Qwen3.5 on radiological datasets for medical Visual Question Answering. This demonstrates how to adapt the model for specialized domains like healthcare.

Conclusion

Qwen3.5 represents an exciting opportunity to use a cutting-edge multimodal AI model at no cost through NVIDIA's developer platform. With its 397B parameter MoE architecture, native vision capabilities, and free API access, it's an excellent choice for:

Building multimodal AI agents
Developing visual reasoning applications
Creating coding assistants with visual context
Automating UI navigation tasks

Getting started is simple: register for the NVIDIA Developer Program, get your API key, and start building.

If you're building applications that integrate with Qwen3.5 or other AI APIs, Apidog provides the testing infrastructure you need. Test your API integrations, validate responses, manage environment variables, and automate your testing workflows with Apidog's comprehensive platform.

button

FAQ

Is Qwen3.5 really free to use?

Yes, NVIDIA provides free access to Qwen3.5 GPU-accelerated endpoints through their Developer Program. No credit card is required. Simply register at build.nvidia.com to get your API key.

What makes Qwen3.5 different from other VLMs?

Qwen3.5 was built specifically for autonomous agents, not adapted from a text-only model. Its Mixture of Experts architecture (397B total, 17B active) provides powerful reasoning while remaining computationally efficient. It's particularly good at UI navigation and visual reasoning tasks.

Can I use Qwen3.5 for commercial projects?

Check the current licensing terms on NVIDIA's platform. For production use, consider NVIDIA NIM for deployment or contact NVIDIA about enterprise options.

What's the difference between the free tier and NIM?

The free tier (Developer Program) uses NVIDIA-hosted endpoints. NIM lets you deploy the model yourself using containers, whether on-premises, in your cloud, or hybrid environments. NIM is designed for production-scale deployments.

How do I handle rate limiting?

The free tier has certain rate limits. For higher limits, consider upgrading to production access through NVIDIA NIM or contacting NVIDIA about enterprise options.

Can I fine-tune Qwen3.5?

Yes! NVIDIA NeMo framework provides tools for fine-tuning Qwen3.5 on your domain-specific data. This includes LoRA for memory-efficient customization and multinode support for large-scale training.