ElevenLabs Audio

Generate multi-speaker voiceovers, single-speaker narration, and sound effects with ElevenLabs models through VicSee API. V3 Dialogue, TTS Turbo 2.5, and Sound Effect V2.

API-only models. ElevenLabs audio models are available exclusively through the VicSee API.

Pricing

ModelCreditsDescription
V3 Text to Dialogue15Multi-speaker voiceover (max 1K chars total)
TTS Turbo 2.58Single-speaker narration (max 1K chars)
Sound Effect V25Sound effects from text (max 22s)

Credits are deducted only on successful generation. All models output MP3 audio.

Endpoint

POST https://vicsee.com/api/v1/generate

See Authentication for API key setup.


V3 Text to Dialogue

Generate multi-speaker voiceovers for ads, podcasts, and video narration. Each speaker gets their own voice.

Parameters


model · string · required

Must be "elevenlabs-text-to-dialogue-v3".


dialogue · array · required

Array of dialogue entries. Each entry has text (what to say) and voice (which voice to use).

Constraints:

  • Total characters across all text entries must not exceed 1,000
  • Each entry must have both text and voice fields

Example:

[
  { "text": "Welcome to Agency HQ!", "voice": "Adam" },
  { "text": "Let's create something amazing.", "voice": "Sarah" }
]

stability · number · optional

Controls voice stability. This is an enum, not a continuous range.

Supported values: 0, 0.5, 1

Default: 0.5


language_code · string · optional

ISO 639-1 language code (e.g., "en", "es", "fr").


Example Request

curl -X POST https://vicsee.com/api/v1/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "elevenlabs-text-to-dialogue-v3",
    "input": {
      "dialogue": [
        { "text": "Welcome to Agency HQ!", "voice": "Adam" },
        { "text": "Let us create something amazing.", "voice": "Sarah" }
      ],
      "stability": 0.5,
      "language_code": "en"
    }
  }'

TTS Turbo 2.5

Single-speaker text-to-speech. Fast, natural narration with fine-tuned voice controls.

Parameters


model · string · required

Must be "elevenlabs-text-to-speech-turbo-2-5".


text · string · required

The text to speak. Maximum 1,000 characters per generation.


voice · string · optional

Voice preset name or voice ID.

Example presets: "Rachel", "Adam", "Sarah", "Antoni", "Bella", "Domi", "Elli", "Josh"

ElevenLabs provides 100+ voices. Pass any valid voice name or ID.


stability · number · optional

Voice stability. 0-1 continuous range. Lower = more expressive, higher = more consistent.

Default: 0.5


similarity_boost · number · optional

Voice clarity and similarity to the original voice. 0-1 continuous range.

Default: 0.75


style · number · optional

Style exaggeration. 0-1 continuous range. Higher values amplify the style of the original voice.

Default: 0


speed · number · optional

Speech speed multiplier.

Range: 0.7 - 1.2

Default: 1.0


language_code · string · optional

ISO 639-1 language code to enforce a specific language (e.g., "en", "es", "fr"). If omitted, the model auto-detects language from the text.


Example Request

curl -X POST https://vicsee.com/api/v1/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "elevenlabs-text-to-speech-turbo-2-5",
    "input": {
      "text": "Coca-Cola. Open Happiness.",
      "voice": "Rachel",
      "stability": 0.5,
      "similarity_boost": 0.75,
      "speed": 1.0
    }
  }'

Sound Effect V2

Generate sound effects from text descriptions. Ideal for background audio, ambient sounds, and foley effects.

Parameters


model · string · required

Must be "elevenlabs-sound-effect-v2".


text · string · required

Description of the sound effect to generate. Maximum 5,000 characters (model limit).

Good prompts are specific:

  • "Busy city street with car horns honking and crowd chatter"
  • "Thunder rumbling in the distance, followed by heavy rain"
  • "Old wooden door creaking open slowly"

duration_seconds · number · optional

Duration of the generated sound effect in seconds.

Range: 0.5 - 22 (step 0.1)

If omitted, the model determines the optimal duration based on the prompt.


loop · boolean · optional

Whether to generate a seamless looping sound effect.

Default: false


prompt_influence · number · optional

How closely the output follows the prompt. Higher values produce less variation between generations.

Range: 0 - 1

Default: 0.3


Example Request

curl -X POST https://vicsee.com/api/v1/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "elevenlabs-sound-effect-v2",
    "input": {
      "text": "Busy city street with car horns and crowd chatter",
      "duration_seconds": 10,
      "loop": false,
      "prompt_influence": 0.3
    }
  }'

Response

All three models return the same response format.

Success (200)

{
  "success": true,
  "data": {
    "id": "task_abc123xyz",
    "model": "elevenlabs-text-to-dialogue-v3",
    "status": "pending",
    "creditsUsed": 15,
    "creditsRemaining": 485,
    "createdAt": "2026-02-23T12:00:00.000Z"
  }
}

Poll for completion using Tasks API.

Task Complete

{
  "success": true,
  "data": {
    "id": "task_abc123xyz",
    "model": "elevenlabs-text-to-dialogue-v3",
    "status": "completed",
    "output": {
      "url": "https://cdn.vicsee.com/dew/abc123.mp3",
      "format": "mp3"
    },
    "createdAt": "2026-02-23T12:00:00.000Z",
    "completedAt": "2026-02-23T12:00:30.000Z"
  }
}

  • Sora 2 — AI video generation with physics-accurate motion
  • Veo 3.1 — Cinematic video with native audio synthesis
  • Kling 2.6 — Video with dialogue and lip-sync