How to Use Fish Audio S2 API: A Complete Guide with Apidog

Fish Audio S2 API is a production-grade text-to-speech REST API powered by a 4-billion-parameter model trained on 10 million hours of audio. It supports voice cloning, streaming, and 50+ languages. To use the Fish Audio S2 API efficiently including sending requests, managing references, and running unit tests Apidog is the fastest way to explore, document, and validate every endpoint.

Introduction

AI-generated voice has crossed a threshold. Modern TTS models no longer sound like robots they whisper, laugh, and shift tone mid-sentence. The Fish Audio S2 API sits at the frontier of this shift: a 4B-parameter model trained on over 10 million hours of multilingual audio, capable of producing speech indistinguishable from a human recording.

Whether you're building a podcast automation tool, an interactive voice assistant, or a real-time dubbing pipeline, integrating the Fish Audio S2 API into your stack requires more than just a single POST request. You need to understand authentication, reference audio management, streaming behavior, and critically how to write reliable unit tests so your integration doesn't break silently in production.

💡

Before your first Fish Audio S2 API call, download Apidog for free. Visually test emotion tags, streaming chunks, voice cloning payloads, and binary audio responses in seconds no code needed. Mock, validate, and listen inline so your TTS integration works perfectly from day one.

button

What Is Fish Audio S2 API?

The Fish Audio S2 API is the HTTP interface to Fish Speech S2-Pro, an open-source TTS system built around a Dual-Autoregressive (Dual-AR) architecture. The model separates semantic generation (4B parameters, slow AR along the time axis) from residual codebook generation (400M parameters, fast AR along the depth axis), enabling high-quality synthesis at a real-time factor of 0.195 on a single NVIDIA H200.

Key Fish Audio S2 API capabilities:

Feature	Detail
Languages	~50 (English, Chinese, Japanese, Korean, Arabic, French, German, and more)
Voice cloning	10–30 second reference audio, no fine-tuning required
Inline emotion control	Natural-language tags: `[laugh]`, `[whispers]`, `[super happy]`
Multi-speaker generation	Native `<\|speaker:i\|>` token support
Streaming	Real-time chunked audio via `"streaming": true`
Output formats	WAV, MP3, PCM
Authentication	Bearer token (`Authorization: Bearer YOUR_API_KEY`)

The Fish Audio S2 API base URL after local deployment is http://127.0.0.1:8080. All endpoints fall under the /v1/ namespace.

Getting Started with Fish Audio S2 API and Apidog

Prerequisites for Fish Audio S2 API

Before making your first Fish Audio S2 API call, you need two things running: a deployed Fish Speech S2-Pro server and an API client capable of handling binary audio responses.

Start the Fish Audio S2 API server:

python tools/api_server.py \
  --llama-checkpoint-path checkpoints/s2-pro \
  --decoder-checkpoint-path checkpoints/s2-pro/codec.pth \
  --listen 0.0.0.0:8080 \
  --compile \
  --half \
  --api-key YOUR_API_KEY \
  --workers 4

The --compile flag activates torch.compile optimization this cuts inference latency by roughly 10× but adds a one-time warmup cost on first launch. The --half flag enables FP16 for reduced GPU memory usage.

Once the server is up, verify it with a health check:

curl http://127.0.0.1:8080/v1/health
# {"status":"ok"}

Setting Up the Fish Audio S2 API in Apidog

Download Apidog for free and create a new HTTP project. Add the base URL http://127.0.0.1:8080 under Environments. Then set a global header:

Authorization: Bearer YOUR_API_KEY

Apidog stores this at the environment level, so every Fish Audio S2 API request you send inherits the token automatically no manual header pasting per request. This is especially useful when you have multiple Fish Audio S2 API environments (local dev, staging, production) to switch between.

Making Your First Fish Audio S2 API Request in Apidog

Testing the Fish Audio S2 API Text-to-Speech Endpoint

The primary Fish Audio S2 API endpoint is POST /v1/tts. In Apidog, create a new request with this URL, set method to POST, and use the following JSON body:

{
  "text": "Hello! This is a test of the Fish Audio S2 API.",
  "format": "wav",
  "streaming": false,
  "temperature": 0.8,
  "top_p": 0.8,
  "repetition_penalty": 1.1,
  "max_new_tokens": 1024
}

Full Fish Audio S2 API TTS request schema:

Parameter	Type	Default	Description
`text`	string	required	Text to synthesize
`format`	string	`"wav"`	Output audio format: `wav`, `mp3`, `pcm`
`chunk_length`	int	200	Synthesis chunk size (100–300)
`seed`	int	null	Fix seed for reproducible output
`streaming`	bool	false	Return audio in real-time chunks
`max_new_tokens`	int	1024	Maximum tokens to generate
`temperature`	float	0.8	Sampling randomness (0.1–1.0)
`top_p`	float	0.8	Nucleus sampling threshold (0.1–1.0)
`repetition_penalty`	float	1.1	Penalize repeated sequences (0.9–2.0)
`use_memory_cache`	string	`"off"`	Cache reference encoding in memory

Hit Send in Apidog. The Fish Audio S2 API returns raw audio bytes. Apidog automatically detects the audio/wav response and renders an inline audio player you can listen to the generated speech directly in the interface, without writing a single line of client code.

Voice Cloning with Fish Audio S2 API

Uploading a Reference Audio to Fish Audio S2 API via Apidog

The Fish Audio S2 API supports zero-shot voice cloning through the references field in the TTS request. You pass a base64-encoded audio clip alongside its transcript, and the model clones that voice for the output.

First, upload a named reference using POST /v1/references/add:

{
  "id": "my-voice-clone",
  "text": "This is the reference transcription matching the audio.",
  "audio": "<base64-encoded-wav-bytes>"
}

In Apidog, use the Binary body type to upload the audio file directly, or switch to Form Data to pass the file and text fields together. The Fish Audio S2 API returns:

{
  "success": true,
  "message": "Reference added successfully",
  "reference_id": "my-voice-clone"
}

Now reference it in your TTS calls using reference_id:

{
  "text": "This sentence will be spoken in the cloned voice.",
  "reference_id": "my-voice-clone",
  "format": "mp3"
}

Apidog's Reference Management panel (under Collections) lets you save this request as a reusable template, so you can swap voices by simply changing the reference_id value useful when testing multiple cloned voices against the same script.

How to Unit Test Fish Audio S2 API Integrations

Why Unit Tests Matter for Fish Audio S2 API

A Fish Audio S2 API integration has several failure modes that are invisible without automated unit tests: a reference ID that no longer exists, a temperature value out of range, a streaming response that's consumed incorrectly, or an audio format mismatch. Unit tests catch these regressions before they reach users.

Writing Unit Tests for Fish Audio S2 API with Python

Here's a Python unit test suite covering the core Fish Audio S2 API flows using pytest and httpx:

import pytest
import httpx
import base64

BASE_URL = "http://127.0.0.1:8080"
API_KEY = "YOUR_API_KEY"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}


class TestFishAudioS2API:
    """Unit tests for Fish Audio S2 API endpoints."""

    def test_health_check(self):
        """Unit test: Fish Audio S2 API health endpoint returns ok."""
        response = httpx.get(f"{BASE_URL}/v1/health", headers=HEADERS)
        assert response.status_code == 200
        assert response.json()["status"] == "ok"

    def test_tts_basic_request(self):
        """Unit test: Fish Audio S2 API TTS returns binary audio."""
        payload = {
            "text": "Unit test: verifying Fish Audio S2 API TTS output.",
            "format": "wav",
            "seed": 42,  # Fixed seed for deterministic unit test output
        }
        response = httpx.post(
            f"{BASE_URL}/v1/tts",
            json=payload,
            headers=HEADERS,
            timeout=60,
        )
        assert response.status_code == 200
        assert response.headers["content-type"] == "audio/wav"
        assert len(response.content) > 1000  # Minimum viable audio size

    def test_tts_invalid_temperature_raises_error(self):
        """Unit test: Fish Audio S2 API rejects out-of-range temperature."""
        payload = {"text": "test", "temperature": 99.0}
        response = httpx.post(
            f"{BASE_URL}/v1/tts",
            json=payload,
            headers=HEADERS,
            timeout=30,
        )
        assert response.status_code == 422  # Validation error expected

    def test_reference_add_and_list(self):
        """Unit test: Fish Audio S2 API reference management endpoints."""
        # Add a reference
        with open("test_reference.wav", "rb") as f:
            audio_b64 = base64.b64encode(f.read()).decode()

        add_response = httpx.post(
            f"{BASE_URL}/v1/references/add",
            json={
                "id": "unit-test-voice",
                "text": "This is a unit test reference audio.",
                "audio": audio_b64,
            },
            headers=HEADERS,
        )
        assert add_response.json()["success"] is True

        # Verify reference appears in list
        list_response = httpx.get(
            f"{BASE_URL}/v1/references/list", headers=HEADERS
        )
        assert "unit-test-voice" in list_response.json()["reference_ids"]

        # Cleanup: delete reference after unit test
        httpx.request(
            "DELETE",
            f"{BASE_URL}/v1/references/delete",
            json={"reference_id": "unit-test-voice"},
            headers=HEADERS,
        )

Run the unit test suite with:

pytest test_fish_audio_s2_api.py -v

Running Fish Audio S2 API Unit Tests with Apidog

Beyond Python unit tests, Apidog has a built-in Test Scenarios (automated testing) feature that runs the same Fish Audio S2 API checks without a local Python environment. In Apidog:

Open your Fish Audio S2 API collection
Click Test Scenarios → New Scenario
Add requests: health check → TTS request → reference add → reference list
In the Assertions tab for the TTS request, add:

Response status = 200
Response header content-type contains audio
Response time < 30000ms

Click Run to execute the full unit test sequence

Apidog generates a pass/fail report for each Fish Audio S2 API assertion, with response timings and diff views. You can export this report or schedule it to run on a CI trigger making Apidog the unit test runner for your Fish Audio S2 API without writing any test framework boilerplate.

Advanced Fish Audio S2 API Features

Streaming Audio from Fish Audio S2 API

For real-time playback applications, the Fish Audio S2 API supports chunked streaming. Set "streaming": true in your request body:

import httpx

with httpx.stream(
    "POST",
    "http://127.0.0.1:8080/v1/tts",
    json={
        "text": "Streaming audio from the Fish Audio S2 API in real time.",
        "format": "wav",
        "streaming": True,
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    timeout=None,
) as response:
    with open("streamed_output.wav", "wb") as audio_file:
        for chunk in response.iter_bytes(chunk_size=4096):
            audio_file.write(chunk)

The Fish Audio S2 API starts returning audio bytes before the full synthesis completes time-to-first-audio is approximately 100ms. This makes it viable for live voice applications where a user expects immediate feedback.

Inline Emotion Control via Fish Audio S2 API

The Fish Audio S2 API passes natural-language emotion tags directly in the text field:

{
  "text": "[whispers] The secret is hidden here. [super happy] I found it!",
  "format": "wav"
}

No special parameter is needed the model interprets the bracketed tags as prosody instructions. Valid tag examples from the Fish Speech source: [laugh], [cough], [pitch up], [professional broadcast tone], [whisper in small voice].

Conclusion

The Fish Audio S2 API exposes a genuinely production-grade TTS engine through a clean REST interface. From basic synthesis to zero-shot voice cloning and real-time streaming, the six endpoints cover the full range of voice generation workflows a developer needs. The keys to a reliable integration are: setting the right sampling parameters (temperature, top_p, repetition_penalty), managing reference audio lifecycle correctly, and maintaining a unit test suite that validates each endpoint's contract.

Apidog compresses the learning curve dramatically. Use it to send your first Fish Audio S2 API request in under two minutes, listen to binary audio responses inline, generate copy-paste client code, and run automated unit tests against every Fish Audio S2 API endpoint without configuring a test framework. When you're ready to share the API spec with your team or document the Fish Audio S2 API integration for stakeholders, Apidog's auto-generated documentation keeps everything in sync.

Download Apidog for free and import the Fish Audio S2 API collection to start testing today.

button

FAQ

What is the Fish Audio S2 API? The Fish Audio S2 API is the REST interface to Fish Speech S2-Pro, a 4B-parameter text-to-speech model trained on 10 million hours of audio. It supports voice cloning, streaming, emotion control, and 50+ languages via HTTP endpoints under /v1/.

How do I authenticate with the Fish Audio S2 API? Send a Bearer token in every request header: Authorization: Bearer YOUR_API_KEY. The API key is configured at server startup via the --api-key flag. Apidog lets you store this token at the environment level so it applies automatically to all Fish Audio S2 API requests.

Can I unit test Fish Audio S2 API integrations without writing code? Yes. Apidog's Test Scenarios feature lets you build and run unit tests against any Fish Audio S2 API endpoint through a visual interface. You define assertions (status code, response time, header values) and Apidog executes them on demand or on a CI schedule no test framework setup required.

What audio formats does the Fish Audio S2 API support? The Fish Audio S2 API returns audio in WAV, MP3, or PCM formats. Specify the format with the "format" field in your TTS request body. WAV is the default.

How does voice cloning work in Fish Audio S2 API? Upload a 10–30 second reference audio clip and its transcript to POST /v1/references/add. Then pass the reference ID to any TTS request via "reference_id". The Fish Audio S2 API clones that voice without any fine-tuning or additional model training.

What is the real-time factor of the Fish Audio S2 API? On a single NVIDIA H200, the Fish Audio S2 API achieves an RTF (real-time factor) of 0.195 with streaming enabled, meaning it generates about 5 seconds of audio per second of compute. Time-to-first-audio is approximately 100ms.

How do I test Fish Audio S2 API responses in Apidog? When the Fish Audio S2 API returns binary audio, Apidog automatically renders an inline audio player. You don't need to save the file locally to verify the output you can listen, check response headers, and add assertions all from the same Apidog request panel.