Voice API

Standalone Voice API

The recommended public-facing product surface for building voice features into paid applications. Predictable auth, direct preview calls, and a realtime stream endpoint for live UX.

Base URL
https://api.msjoratio.com

All Voice API requests are made to this production hostname. The service runs behind a Caddy reverse proxy with TLS termination. For the platform REST API (TTS generation, keys, billing) see the Platform API Reference.

Authentication

Two authentication headers are accepted. Use either format:

X-API-Key: <VOICE_API_AUTH_TOKEN>
Authorization: Bearer <VOICE_API_AUTH_TOKEN>

Product Recommendations

  • One project-level API key per workspace or account
  • Token-based balance and rate-limit rules enforced by your backend or gateway
  • Browser apps should use a backend proxy or short-lived token model — never embed permanent keys in client bundles
Endpoint Families

13 endpoints across 5 families: health, voice lifecycle, conditioning, preview generation, and validation.

MethodPathPurpose
GET/healthLiveness and storage path summary
GET/versionService and XTTS model version
POST/voices/analyzeAnalyze uploaded audio
POST/voices/createCreate voice identity from uploaded audio
POST/voices/reference-stackBuild merged reference stack
POST/voices/embeddingBuild reference assets
POST/voices/conditioningGenerate or fetch conditioning payloads
POST/voices/validateValidate preview against voice
POST/voices/previewGenerate completed preview response
POST/voices/preview/streamRealtime MP3 streaming preview
GET/voices/{voice_id}Fetch registry entry
GET/voices/{voice_id}/manifestFetch manifest
POST/voices/{voice_id}/bundleExport bundle
Admin Operations

Voice lifecycle management is handled through the admin portal and internal tooling:

Provision platform voices from reference audio
Rebuild conditioning embeddings and reference stacks
Upgrade or downgrade voice quality tiers
Train, fine-tune, and promote core XTTS engine versions

Open the admin provisioning console at /admin/voice-provisioning.

Health & Version
GET/health

Returns a lightweight service report: service name, version, data root path, and voices directory path. Use for liveness probes and deployment verification.

GET/version

Returns the service version and XTTS model identifier.

Voice Lifecycle Endpoints
POST/voices/analyze

Quick technical analysis of an audio file before promoting it into a voice identity. Use this for quality checks and audio inspection.

POST/voices/create

The primary voice registration entrypoint. Creates a voice identity from uploaded audio:

FieldTypeDescription
voice_idstringUnique identifier for the voice
audiofileUploaded audio file (WAV/MP3)
display_namestringHuman-readable voice label (optional)
style_tagsstring[]Descriptive style tags (optional)
consent_modestringConsent status for the voice (optional)
notesstringInternal notes (optional)
POST/voices/reference-stack

Builds a cleaned and merged reference stack from multiple uploaded audio files. Produces a more stable identity source than a single clip.

POST/voices/embedding

Builds reusable reference assets (speaker embedding) for a given voice and version. These can be passed to preview endpoints to skip re-computation.

POST/voices/conditioning

Returns speaker embedding and GPT conditioning latent payloads. Can rebuild them when missing. Supports deterministic XTTS reuse instead of recomputing identity assets on every call.

GET/voices/{voice_id}

Fetch the registry entry for a specific voice. Returns voice metadata, version info, and configuration.

GET/voices/{voice_id}/manifest

Fetch the full manifest for a voice, including all versions, reference stacks, and asset metadata.

Preview Endpoints

Request Parameters (both preview endpoints)

NameTypeRequiredDescription
voice_idstringRequiredVoice identifier from the registry
textstringRequiredText to synthesize into speech
versionstringOptionalVoice version to use. Defaults to latest.
include_audio_base64booleanOptionalIf true, includes base64-encoded audio in the response (JSON preview only)
persist_previewbooleanOptionalIf true, saves the preview to server-side storage
speech_ratefloatOptionalSpeech rate multiplier. Default 1.0
temperaturefloatOptionalSampling temperature for generation. Default 0.7
top_pfloatOptionalNucleus sampling parameter. Default 0.9
speaker_embedding_jsonstringOptionalPre-computed speaker embedding JSON. Skips embedding computation.
gpt_cond_latent_jsonstringOptionalPre-computed GPT conditioning latent JSON. Skips conditioning computation.
POST/voices/preview

Returns a completed preview payload. Response includes output path, generation metrics, optional audio_base64, voice ID and version.

Best for:

  • Waveform analysis and metadata display
  • Easier persistence and debugging
  • History and metrics panels
  • Storing output for later replay

Response Schema

{
  "voice_id": "default_narrator",
  "version": "v1",
  "output_path": "/data/previews/abc123.wav",
  "audio_base64": "UklGRi4A...",
  "metrics": {
    "duration_seconds": 2.3,
    "generation_time_ms": 480,
    "rtf": 0.21
  }
}

cURL Example

curl -X POST "https://api.msjoratio.com/voices/preview" \ -H "Content-Type: application/json" \ -H "X-API-Key: $VOICE_API_AUTH_TOKEN" \ -d '{ "voice_id": "default_narrator", "text": "Hello from the Voice API.", "include_audio_base64": true, "persist_preview": true, "speech_rate": 1.0 }'

JavaScript Example

const response = await fetch( "https://api.msjoratio.com/voices/preview", { method: "POST", headers: { "Content-Type": "application/json", "X-API-Key": apiKey, }, body: JSON.stringify({ voice_id: "default_narrator", text: "Hello from the Voice API.", include_audio_base64: true, }), } ); const data = await response.json(); // Play audio from base64 const audio = new Audio("data:audio/wav;base64," + data.audio_base64); audio.play();
POST/voices/preview/stream

Streams audio/mpeg bytes in realtime with Cache-Control: no-store. Custom response headers include voice ID and version.

Best for:

  • Immediate browser playback
  • Low-latency preview UIs
  • Streaming QA workflows
  • “Listen before you spend” modals

Response Details

  • Content-Type: audio/mpeg
  • Cache-Control: no-store
  • X-Voice-Id: voice identifier
  • X-Voice-Version: version used

cURL Example

curl -X POST "https://api.msjoratio.com/voices/preview/stream" \ -H "Content-Type: application/json" \ -H "X-API-Key: $VOICE_API_AUTH_TOKEN" \ --output preview.mp3 \ -d '{ "voice_id": "default_narrator", "text": "This is a realtime preview stream.", "speech_rate": 1.0 }'

JavaScript Example

const response = await fetch( "https://api.msjoratio.com/voices/preview/stream", { method: "POST", headers: { "Content-Type": "application/json", "X-API-Key": apiKey, }, body: JSON.stringify({ voice_id: "default_narrator", text: "Streaming audio in realtime.", }), } ); // Stream audio/mpeg bytes to an <audio> element const blob = await response.blob(); const url = URL.createObjectURL(blob); const audio = new Audio(url); audio.play();

Typical UI Patterns

"Generate preview" player
"Listen before you spend" modal
Voice picker with live stream
Validation & Bundle
POST/voices/validate

Compares a preview clip against a stored voice identity and returns a validation report. Use to verify that generated audio matches the expected voice profile.

POST/voices/{voice_id}/bundle

Exports a distributable or archival bundle for a voice version. Includes all reference assets, embeddings, and metadata.

Data Model

The core request model for previews includes these key fields. All are passed in the JSON request body.

voice_idtextversioninclude_audio_base64persist_previewspeech_ratetemperaturetop_pspeaker_embedding_jsongpt_cond_latent_json
Tip: Pre-compute speaker_embedding_json and gpt_cond_latent_json via the /voices/conditioning endpoint, then pass them to preview calls to skip re-computation and improve latency.
Operational Notes

Use /voices/preview when:

  • You want a single completed JSON response
  • You need metrics and stored output metadata
  • You want to inspect or store the audio data

Use /voices/preview/stream when:

  • Low-latency playback matters
  • Building a live preview player
  • You don't need metadata alongside audio

Deployment Architecture

  • Production domain: api.msjoratio.com
  • Caddy terminates TLS publicly
  • App binds locally on the VM; Caddy reverse proxies to the Voice API app port