How to use the Grok image to video API (step-by-step guide)

Learn how to use the Grok image-to-video API step by step: animate images, poll async results, control resolution and duration, and test your flow with Apidog.

Ashley Innocent

Ashley Innocent

3 April 2026

How to use the Grok image to video API (step-by-step guide)

TL;DR

The Grok image-to-video API uses the grok-imagine-video model to animate a static image into a video clip. You POST your image URL, a prompt, and optional settings to https://api.x.ai/v1/videos/generations. The API returns a request_id right away. You then poll GET /v1/videos/{request_id} until status becomes "done". Duration ranges from 1 to 15 seconds. Pricing starts at $0.05 per second for 480p output.

Introduction

On January 28, 2026, xAI launched the grok-imagine-video model for public API access. Within that first month, the model generated 1.2 billion videos and ranked number one on the Artificial Analysis text-to-video leaderboard. Image-to-video is one of its flagship capabilities: you hand the API a photograph and a descriptive prompt, and it animates the photograph into a short video clip ready to download as an MP4.

That async flow, where you submit a job and then poll for completion, introduces a testing challenge many developers skip over. Your integration isn't finished when the first POST returns 200. It's finished when you've confirmed the polling loop handles "processing", "done", and "failed" states correctly under real network conditions.

Apidog's Test Scenarios solve this directly. You can build a chained sequence: post to /v1/videos/generations, extract the request_id, loop the poll request until status == "done", then assert the video URL is present. Download Apidog free to follow the testing walkthrough later in this guide.

button

What is the Grok image to video API?

The Grok image-to-video API is part of xAI's video generation product. It lives under the grok-imagine-video model and accepts an image as the starting frame of the output video. The model studies the image content and the text prompt, then generates natural motion to animate the scene.

The API endpoint is:

POST https://api.x.ai/v1/videos/generations

Authentication uses a standard Bearer token:

Authorization: Bearer YOUR_XAI_API_KEY

You get your key from the xAI console. The same API surface also supports text-to-video (omit the image parameter), video extensions, and video edits.

How the image-to-video process works

The image parameter in the request body designates the first frame of the output video. The model does not replace the image. It starts from it. Every pixel in the first frame comes from your source image. The model then predicts how that scene would move forward in time based on your prompt.

For example: you provide a photograph of a mountain lake at sunrise. Your prompt says "gentle ripples spread across the water as morning mist drifts." The first frame of the output video is your photograph. Subsequent frames show the water and mist animating according to the prompt.

This is different from text-to-video, where the model generates the first frame itself. Image-to-video gives you exact control over the starting scene.

You should choose image-to-video when: - You have existing product photos, landscapes, or portraits you want to bring to motion. - Your brand assets need consistent visual identity in the first frame. - You want motion to feel grounded in a real or specific scene.

You should choose text-to-video when: - You're exploring visual ideas without a reference image. - You want the model to decide the scene composition entirely. - Speed of iteration matters more than first-frame precision.

Prerequisites

Before making your first call, you need:

  1. An xAI account at console.x.ai.
  2. An API key from the xAI console. Keep this in an environment variable, not hardcoded.
  3. Python 3.8+ or Node.js 18+ (examples in this guide use both).
  4. A publicly accessible image URL, or a base64-encoded image as a data URI.

Set your key as an environment variable:

export XAI_API_KEY="your_key_here"

Install the xAI Python SDK if you want the higher-level client:

pip install xai-sdk

For raw HTTP calls, no additional packages are required beyond requests (Python) or fetch (Node.js).

Making your first image-to-video request

Using curl

curl -X POST https://api.x.ai/v1/videos/generations \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-imagine-video",
    "prompt": "Gentle waves move across the surface, morning mist rises slowly",
    "image": {
      "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/24701-nature-natural-beauty.jpg/1280px-24701-nature-natural-beauty.jpg"
    },
    "duration": 6,
    "resolution": "720p",
    "aspect_ratio": "16:9"
  }'

The response comes back immediately with a request_id:

{
  "request_id": "d97415a1-5796-b7ec-379f-4e6819e08fdf"
}

The video is not ready yet. Generation happens asynchronously in xAI's infrastructure. You need to poll for the result.

Using Python (raw requests)

import os
import requests

api_key = os.environ["XAI_API_KEY"]

payload = {
    "model": "grok-imagine-video",
    "prompt": "Gentle waves move across the surface, morning mist rises slowly",
    "image": {
        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/24701-nature-natural-beauty.jpg/1280px-24701-nature-natural-beauty.jpg"
    },
    "duration": 6,
    "resolution": "720p",
    "aspect_ratio": "16:9"
}

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

response = requests.post(
    "https://api.x.ai/v1/videos/generations",
    json=payload,
    headers=headers
)

data = response.json()
request_id = data["request_id"]
print(f"Job started: {request_id}")

Using a base64 image

If your image is local or not publicly accessible, encode it as a data URI:

import base64

with open("my_image.jpg", "rb") as f:
    encoded = base64.b64encode(f.read()).decode("utf-8")

payload["image"] = {
    "url": f"data:image/jpeg;base64,{encoded}"
}

Polling for the result

Video generation is asynchronous. The API returns request_id while your video renders on xAI's servers. You must poll the status endpoint:

GET https://api.x.ai/v1/videos/{request_id}

The status field moves through these values:

Status Meaning
"processing" Video is still rendering
"done" Video is ready, URL is in the response
"failed" Something went wrong

A completed response looks like this:

{
  "status": "done",
  "video": {
    "url": "https://vidgen.x.ai/....mp4",
    "duration": 6
  },
  "progress": 100
}

Full Python polling loop

import time

def poll_video(request_id: str, api_key: str, interval: int = 5) -> dict:
    url = f"https://api.x.ai/v1/videos/{request_id}"
    headers = {"Authorization": f"Bearer {api_key}"}

    while True:
        response = requests.get(url, headers=headers)
        data = response.json()
        status = data.get("status")

        print(f"Status: {status} | Progress: {data.get('progress', 0)}%")

        if status == "done":
            return data["video"]
        elif status == "failed":
            raise RuntimeError(f"Video generation failed for {request_id}")

        time.sleep(interval)

# Usage
video = poll_video(request_id, api_key)
print(f"Video URL: {video['url']}")
print(f"Duration: {video['duration']}s")

Keep the polling interval at 5 seconds or higher. The API has a rate limit of 60 requests per minute (1 per second). Tight polling on multiple jobs simultaneously can burn that budget quickly.

Using the xAI Python SDK

The xai-sdk library wraps the async pattern for you. client.video.generate() submits the job and blocks until the video is ready, handling all polling internally:

from xai_sdk import Client
import os

client = Client(api_key=os.environ["XAI_API_KEY"])

video = client.video.generate(
    model="grok-imagine-video",
    prompt="Gentle waves move across the surface, morning mist rises slowly",
    image={"url": "https://example.com/landscape.jpg"},
    duration=6,
    resolution="720p",
    aspect_ratio="16:9"
)

print(f"Video URL: {video.url}")
print(f"Duration: {video.duration}s")

The SDK handles the polling loop, status checks, and error propagation. Use this approach when you want clean application code without managing HTTP polling yourself.

For fine-grained control over polling intervals, retry strategies, or logging, the raw requests approach gives you more flexibility.

Controlling resolution, duration, and aspect ratio

The Grok video API gives you direct control over the output format.

Duration

The duration parameter accepts integers from 1 to 15 seconds. The default is 6.

"duration": 10

Longer videos cost more. A 10-second clip costs roughly 10 times a 1-second clip at the same resolution.

Resolution

Two options are available:

Value Description
"480p" Default. Lower cost, faster generation.
"720p" Higher quality. Costs $0.07/sec vs $0.05/sec.
"resolution": "720p"

Aspect ratio

The aspect_ratio parameter controls the output frame dimensions:

Value Use case
"16:9" Default. Widescreen for landscape scenes.
"9:16" Vertical for mobile or social stories.
"1:1" Square for Instagram or social thumbnails.
"4:3" Classic photography or presentation format.
"3:4" Portrait photography.
"3:2" Standard photography crop.
"2:3" Tall portrait format.

When you provide an image, the aspect ratio defaults to match the source image's dimensions. Set it explicitly to override or crop.


Using reference images for style guidance

The reference_images parameter is distinct from the image parameter. Understanding the difference is important.

image: The source photograph that becomes the first frame of the video. The model animates from this starting point.

reference_images: An array of up to 7 images that guide the style, content, or visual context of the generated video. These are not frames in the output. They influence how the model renders motion and appearance.

Use reference_images when you want the output video to adopt visual characteristics from existing assets, but not as the starting frame:

{
  "model": "grok-imagine-video",
  "prompt": "A product rotating slowly on a clean white surface",
  "image": {
    "url": "https://example.com/product-shot.jpg"
  },
  "reference_images": [
    {"url": "https://example.com/brand-style-reference-1.jpg"},
    {"url": "https://example.com/lighting-reference.jpg"}
  ],
  "duration": 6,
  "resolution": "720p"
}

In this example, product-shot.jpg is the first frame. The reference images guide the lighting and stylistic treatment.

You can supply reference images without a first-frame image at all. In that case, the model generates a text-to-video output while drawing style guidance from the references.

Extending and editing videos

The API supports two additional operations beyond initial generation.

Extending a video

POST /v1/videos/extensions takes an existing video and generates additional seconds from where it left off. This is useful for creating longer clips from multiple generation passes, staying within the 15-second per-call limit.

curl -X POST https://api.x.ai/v1/videos/extensions \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-imagine-video",
    "video_id": "your_original_request_id",
    "prompt": "The mist continues to lift as sunlight breaks through",
    "duration": 5
  }'

The response follows the same async pattern: poll GET /v1/videos/{request_id} for the extended clip.

Editing a video

POST /v1/videos/edits applies prompt-guided modifications to an existing video. You can change specific aspects of the content or motion without regenerating from scratch.

curl -X POST https://api.x.ai/v1/videos/edits \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-imagine-video",
    "video_id": "your_original_request_id",
    "prompt": "Change the sky to a dramatic sunset with deep orange tones"
  }'

Both extensions and edits are asynchronous and use the same polling pattern.

Pricing breakdown: what a 10-second video costs

The xAI video API charges for two components: the input image processing and the output video duration.

Component Cost
Input image $0.002 per image
Output at 480p $0.05 per second
Output at 720p $0.07 per second

Example: 10-second video at 720p

Example: 6-second video at 480p (default settings)

The input image charge applies each time you submit a generation request, even if you re-use the same image URL. Plan your generation calls accordingly if you're iterating on the same base image.

Text-to-video (no image parameter) omits the $0.002 input charge but otherwise follows the same per-second pricing.

How to test your Grok video API integration with Apidog

The async pattern creates a testing challenge that simple one-shot request tests can't cover. You need to verify that:

  1. The generation request returns a request_id.
  2. The polling request correctly handles "processing" status while waiting.
  3. The final response has status == "done" and a non-empty video URL.

Apidog's Test Scenarios chain these steps together in one automated flow. Here's how to build it:

Step 1: Create a new Test Scenario

In Apidog, open the Tests module and click the + button to create a new scenario. Name it "Grok image-to-video async flow."

Step 2: Add the generation request

Add a custom POST request step:

{
  "model": "grok-imagine-video",
  "prompt": "Gentle mist rises from the water as light filters through the trees",
  "image": {
    "url": "https://example.com/your-test-image.jpg"
  },
  "duration": 6,
  "resolution": "480p"
}

Step 3: Extract the request_id

After the POST step, add an Extract Variable processor. Configure it:

Apidog stores the extracted value in {{video_request_id}} for use in later steps.

Step 4: Build the polling loop

Add a For loop processor. Inside the loop, add the poll request:

Add an Extract Variable processor inside the loop to capture the current status:

Add a Wait processor (5000ms) after the status extraction to avoid hitting the rate limit.

Set the loop's Break If condition: {{video_status}} == "done".

Step 5: Assert the video URL

After the For loop, add a final GET step to the same poll endpoint. Add an Assertion processor:

This assertion confirms the video URL is present before your test passes.

For a deeper look at how to test async APIs with Apidog, including more complex polling patterns and CI/CD integration, see that dedicated guide.

Running the scenario

Click Run in the test scenario view. Apidog executes the POST, extracts the request_id, loops the poll until status == "done", and then evaluates your assertions. The test report shows each step's status and timing.

You can plug this scenario into your CI/CD pipeline with the Apidog CLI:

apidog run --scenario grok-video-async-flow --env production

Common errors and fixes

401 Unauthorized

Your API key is missing or invalid. Check the Authorization header format: Bearer YOUR_XAI_API_KEY. Confirm the key is active in the xAI console.

422 Unprocessable Entity

The request body is malformed. Common causes: the model field is missing, the prompt is empty, or the image.url is not accessible. Test the image URL in a browser before using it.

Image URL not accessible

xAI's servers must be able to fetch the image URL at generation time. Private URLs, localhost addresses, or URLs behind authentication will fail. Use a public CDN or a base64 data URI instead.

Status stays at "processing" indefinitely

A generation can take anywhere from 30 seconds to several minutes depending on resolution and duration. If status stays in "processing" beyond 10 minutes, the job may have stalled. Submit a new request. The xAI API does not currently expose a timeout signal separately from "failed".

Rate limit errors (429)

The API allows 60 requests per minute and 1 per second. If you're polling multiple jobs concurrently, stagger your requests. Add a time.sleep(1) between poll calls at minimum.

Base64 upload rejected

Make sure your data URI includes the correct MIME type prefix. Use data:image/jpeg;base64, for JPEG files and data:image/png;base64, for PNG files.

Aspect ratio mismatch

When you set an explicit aspect_ratio that differs significantly from your source image's proportions, the model may crop or letterbox. Match the aspect ratio to your source image for best results.

Conclusion

The Grok image-to-video API gives you a direct path from a static photograph to a short animated clip. You POST the image and prompt, receive a request_id, poll until done, and download the MP4. The grok-imagine-video model ranked at the top of the Artificial Analysis leaderboard in January 2026. Over a billion videos were generated in that single month. That scale reflects how capable the underlying model is.

The async polling pattern is where most integrations go wrong. A proper test in Apidog's Test Scenarios covers the Extract Variable step, the polling loop with break condition, and a final URL assertion. That combination catches issues before they reach production.

button

Start building your integration with Apidog free. No credit card required.

FAQ

What model name do I use for the Grok image-to-video API?

The model name is grok-imagine-video. Pass it as the model field in your POST request body.

What's the difference between the image and reference_images parameters?

The image parameter sets the first frame of the output video. The model animates forward from that starting image. The reference_images array provides style and content guidance without being used as a frame. You can combine both in the same request.

How long does video generation take?

Generation time varies by duration and resolution. A 6-second 480p video typically takes 1 to 3 minutes. A 15-second 720p video may take 4 to 8 minutes. Poll every 5 seconds to check status without burning your rate limit.

Can I use a local file as the source image?

Yes. Encode your local file as a base64 data URI: data:image/jpeg;base64,{encoded_bytes}. Pass that string as the url value inside the image object.

What happens if I don't specify aspect_ratio?

When you provide an image parameter, the aspect ratio defaults to match the source image's native proportions. When generating text-to-video without an image, the default is 16:9.

How much does a 10-second 720p video cost?

The input image costs $0.002. The output costs 10 × $0.07 = $0.70. Total: approximately $0.702 per video.

What are the rate limits?

The API allows 60 requests per minute and 1 request per second. This covers both the generation POST and polling GET requests combined.

Can I extend a video beyond 15 seconds?

Yes, using the POST /v1/videos/extensions endpoint. You generate an initial clip up to 15 seconds, then extend it with additional generation passes. Each extension also follows the async polling pattern.

Explore more

How to use the Grok text to video API (complete guide)

How to use the Grok text to video API (complete guide)

Learn how to use the Grok text-to-video API: generate videos from text prompts, poll for results, control resolution and duration, and test with Apidog.

3 April 2026

How to run Gemma 4 locally with Ollama: a complete guide

How to run Gemma 4 locally with Ollama: a complete guide

Run Gemma 4 locally with Ollama v0.20.0: install the model, call the local REST API, enable function calling and thinking mode, and test endpoints with Apidog.

3 April 2026

How do you run Gemma 4 as an API backend?

How do you run Gemma 4 as an API backend?

Learn how to call the Gemma 4 API, generate schema-compliant mock data, and validate AI responses in Apidog Test Scenarios with step-by-step code examples.

3 April 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs