How to use reference video in Seedance 2.0: copy motion and camera moves

TL;DR

Reference video in Seedance 2.0 lets you anchor motion — camera moves, character choreography, timing — to an existing clip rather than describing everything in text. Use 3-8 second reference clips: single shot, no jump cuts, clean H.264 compression. Keep text prompts short (three adjectives or fewer for style). Text describes what the reference can’t show; the reference handles the motion. If your output drifts or ignores the reference, follow the troubleshooting ladder in this guide.

Introduction

Text-only video generation works well for loose concepts: atmospheric scenes, exploratory directions, varied visual approaches. When the motion is already decided — the specific timing of a gesture, a camera push-in, a walk cycle — text descriptions are imprecise.

Reference video closes that gap. You provide a clip that shows what you want, and Seedance 2.0 reinterprets the motion into the new scene you’ve described.

This guide covers when reference video helps versus when text alone is better, how to prepare effective reference clips, and how to fix the most common issues.

button

When to use reference video

Reference video works best for:

Micro-gestures: Precise timing like “a thumb tap” or “a nod that lands on beat three.” Text can’t capture the exact timing; a reference clip can.
Choreography: Consistent motion patterns like walks with a specific cadence or a repeated physical routine.
Camera moves: Subtle operations like slow push-ins, controlled orbits, or specific framing changes. These are hard to describe precisely.
Beat-matching: Synchronizing actions to audio cues. The model can read timing from a reference clip better than from a text description.

Text-only is better for:

Loose concepts or atmospheric pieces where variety is good
Exploring different visual directions for the same content
When you don’t have an appropriate reference clip and the motion is simple enough to describe

Preparing reference clips

A good reference clip has these characteristics:

Length: 3-8 seconds. Shorter clips give the model too little information. Longer clips risk reducing model confidence and producing inconsistent output.

Continuity: No edits, no jump cuts, no cuts of any kind. A single continuous shot from start to finish.

Compression: Clean H.264 without macro-blocking artifacts. Compressed or re-encoded clips with visible artifacting produce worse results.

Subject clarity: Plain backgrounds and steady lighting help the model read the subject’s silhouette and movement clearly. Busy backgrounds compete with the subject for the model’s attention.

Checklist before uploading a reference clip:

[ ] Under 8 seconds
[ ] Single continuous shot, no cuts
[ ] Clean compression, no visible blocking
[ ] Subject visible against background
[ ] Steady lighting throughout

Prompting with a reference clip

When combining a reference clip with a text prompt, the text should complement rather than repeat the reference.

Focus the text on what the reference doesn’t show:

The reference handles motion and timing. Use text for:

Style descriptors (lighting, color palette, visual tone)
Subject identity (who or what appears in the new scene)
Camera context (if not already clear from the reference)
One or two constraints

Optimal prompt structure:

Style: [2-3 descriptors for lighting and palette]
Subject: [identity description using stable visible features]  
Camera: [if different from reference]
Reference intent: "Respect motion from reference: reinterpret texture and color."
Must not: [one specific constraint if needed]

Example:

Reference clip: a person walking with a specific measured pace

Text prompt:

Style: warm afternoon light, golden tones
Subject: a man in a gray suit, early 40s, confident posture
Respect motion from reference: reinterpret texture and color.
Must not: change walking pace

The three-adjective limit:

More than three style descriptors creates conflicting instructions. The model tries to incorporate all of them and often satisfies none well. Pick the three most important descriptors and drop the rest.

API usage via WaveSpeedAI

Seedance 2.0 is accessible via WaveSpeedAI’s API. The reference video endpoint:

POST https://api.wavespeed.ai/api/v2/seedance/v2/image-to-video
Authorization: Bearer {{WAVESPEED_API_KEY}}
Content-Type: application/json

{
  "prompt": "Warm afternoon light, golden tones. A man in a gray suit walks forward. Respect motion from reference.",
  "image_url": "https://example.com/subject-reference.jpg",
  "reference_video_url": "https://example.com/motion-reference.mp4",
  "duration": 5,
  "aspect_ratio": "16:9"
}

Testing with Apidog

Set up a test collection before building your integration.

Environment setup:

Create an Apidog environment with WAVESPEED_API_KEY as a Secret variable.

Two-request flow:

Request 1 starts the generation. Request 2 polls for completion.

Request 1:

POST https://api.wavespeed.ai/api/v2/seedance/v2/image-to-video
Authorization: Bearer {{WAVESPEED_API_KEY}}
Content-Type: application/json

{
  "prompt": "{{motion_prompt}}",
  "image_url": "{{subject_image}}",
  "reference_video_url": "{{reference_clip}}",
  "duration": {{duration}},
  "aspect_ratio": "16:9"
}

In the Tests tab, extract the job ID for polling:

pm.environment.set("job_id", pm.response.json().id);

Request 2:

GET https://api.wavespeed.ai/api/v2/predictions/{{job_id}}
Authorization: Bearer {{WAVESPEED_API_KEY}}

Assert:

Response body, field status equals "completed"

Troubleshooting guide

Motion jitter

Trim the clip to remove unintended micro-adjustments at the edges
Reduce visual noise in the source footage
Stabilize during capture rather than adding stabilization in post
Shorten reference length to 3-5 seconds
Simplify the text prompt (remove descriptors that might conflict)

Reference ignored (model ignores the reference clip)

Exaggerate the move slightly and center the subject in frame
Include only one type of motion per clip (don’t mix camera moves with character movement)
Explicitly call out the move in the text: “copy camera movement from reference”
Extract the cleanest 2-3 second span from the reference clip
Use reference marks (tape on a surface) for parallax clarity in camera move references

Style drift (output doesn’t match intended aesthetic)

Reduce style descriptors to two or three
Add a single static reference frame alongside the video reference
Simplify patterns and busy details in the reference clip
Keep settings consistent across renders
Lock the motion first (get the motion right before iterating on appearance)

Reference video with identifiable people requires consent. Practical requirements:

Written consent from anyone whose motion or likeness appears in the reference clip
Guardian signatures for minors
Verify that filming locations permit commercial use
Exclude prominent logos or third-party marks from the reference
Keep records: dates, consent notes, clip versions

These apply to both the reference clip and any identifiable subjects who appear in the generated output.

FAQ

Does the reference video replace the image reference?
They serve different purposes. The image reference anchors subject appearance (who appears in the scene). The video reference anchors motion (how subjects and camera move). Use both when you want to control appearance and motion independently.

How long should the reference clip be?
3-8 seconds. Too short: the model has insufficient motion information. Too long: model confidence drops and output becomes inconsistent.

Can I use a reference clip from a different genre?
Yes. You can use a reference clip of a person walking from one context and generate a robot character walking with that same gait. The motion transfers; the visual content is replaced by your text description and subject reference.

What resolution should the reference clip be?
720p or higher. Very low-resolution reference clips provide less motion information and produce lower quality transfers.

Can I generate multiple clips from the same reference?
Yes. The same reference clip can drive multiple generations with different prompts. This is useful for generating multiple scene variations with consistent motion.