Standalone Voice API
The recommended public-facing product surface for building voice features into paid applications. Predictable auth, direct preview calls, and a realtime stream endpoint for live UX.
All Voice API requests are made to this production hostname. The service runs behind a Caddy reverse proxy with TLS termination. For the platform REST API (TTS generation, keys, billing) see the Platform API Reference.
Two authentication headers are accepted. Use either format:
Product Recommendations
- One project-level API key per workspace or account
- Token-based balance and rate-limit rules enforced by your backend or gateway
- Browser apps should use a backend proxy or short-lived token model — never embed permanent keys in client bundles
13 endpoints across 5 families: health, voice lifecycle, conditioning, preview generation, and validation.
| Method | Path | Purpose |
|---|---|---|
| GET | /health | Liveness and storage path summary |
| GET | /version | Service and XTTS model version |
| POST | /voices/analyze | Analyze uploaded audio |
| POST | /voices/create | Create voice identity from uploaded audio |
| POST | /voices/reference-stack | Build merged reference stack |
| POST | /voices/embedding | Build reference assets |
| POST | /voices/conditioning | Generate or fetch conditioning payloads |
| POST | /voices/validate | Validate preview against voice |
| POST | /voices/preview | Generate completed preview response |
| POST | /voices/preview/stream | Realtime MP3 streaming preview |
| GET | /voices/{voice_id} | Fetch registry entry |
| GET | /voices/{voice_id}/manifest | Fetch manifest |
| POST | /voices/{voice_id}/bundle | Export bundle |
Voice lifecycle management is handled through the admin portal and internal tooling:
Open the admin provisioning console at /admin/voice-provisioning.
/healthReturns a lightweight service report: service name, version, data root path, and voices directory path. Use for liveness probes and deployment verification.
/versionReturns the service version and XTTS model identifier.
/voices/analyzeQuick technical analysis of an audio file before promoting it into a voice identity. Use this for quality checks and audio inspection.
/voices/createThe primary voice registration entrypoint. Creates a voice identity from uploaded audio:
| Field | Type | Description |
|---|---|---|
| voice_id | string | Unique identifier for the voice |
| audio | file | Uploaded audio file (WAV/MP3) |
| display_name | string | Human-readable voice label (optional) |
| style_tags | string[] | Descriptive style tags (optional) |
| consent_mode | string | Consent status for the voice (optional) |
| notes | string | Internal notes (optional) |
/voices/reference-stackBuilds a cleaned and merged reference stack from multiple uploaded audio files. Produces a more stable identity source than a single clip.
/voices/embeddingBuilds reusable reference assets (speaker embedding) for a given voice and version. These can be passed to preview endpoints to skip re-computation.
/voices/conditioningReturns speaker embedding and GPT conditioning latent payloads. Can rebuild them when missing. Supports deterministic XTTS reuse instead of recomputing identity assets on every call.
/voices/{voice_id}Fetch the registry entry for a specific voice. Returns voice metadata, version info, and configuration.
/voices/{voice_id}/manifestFetch the full manifest for a voice, including all versions, reference stacks, and asset metadata.
Request Parameters (both preview endpoints)
| Name | Type | Required | Description |
|---|---|---|---|
| voice_id | string | Required | Voice identifier from the registry |
| text | string | Required | Text to synthesize into speech |
| version | string | Optional | Voice version to use. Defaults to latest. |
| include_audio_base64 | boolean | Optional | If true, includes base64-encoded audio in the response (JSON preview only) |
| persist_preview | boolean | Optional | If true, saves the preview to server-side storage |
| speech_rate | float | Optional | Speech rate multiplier. Default 1.0 |
| temperature | float | Optional | Sampling temperature for generation. Default 0.7 |
| top_p | float | Optional | Nucleus sampling parameter. Default 0.9 |
| speaker_embedding_json | string | Optional | Pre-computed speaker embedding JSON. Skips embedding computation. |
| gpt_cond_latent_json | string | Optional | Pre-computed GPT conditioning latent JSON. Skips conditioning computation. |
/voices/previewReturns a completed preview payload. Response includes output path, generation metrics, optional audio_base64, voice ID and version.
Best for:
- Waveform analysis and metadata display
- Easier persistence and debugging
- History and metrics panels
- Storing output for later replay
Response Schema
{
"voice_id": "default_narrator",
"version": "v1",
"output_path": "/data/previews/abc123.wav",
"audio_base64": "UklGRi4A...",
"metrics": {
"duration_seconds": 2.3,
"generation_time_ms": 480,
"rtf": 0.21
}
}cURL Example
JavaScript Example
/voices/preview/streamStreams audio/mpeg bytes in realtime with Cache-Control: no-store. Custom response headers include voice ID and version.
Best for:
- Immediate browser playback
- Low-latency preview UIs
- Streaming QA workflows
- “Listen before you spend” modals
Response Details
Content-Type:audio/mpegCache-Control:no-storeX-Voice-Id:voice identifierX-Voice-Version:version used
cURL Example
JavaScript Example
Typical UI Patterns
/voices/validateCompares a preview clip against a stored voice identity and returns a validation report. Use to verify that generated audio matches the expected voice profile.
/voices/{voice_id}/bundleExports a distributable or archival bundle for a voice version. Includes all reference assets, embeddings, and metadata.
The core request model for previews includes these key fields. All are passed in the JSON request body.
speaker_embedding_json and gpt_cond_latent_json via the /voices/conditioning endpoint, then pass them to preview calls to skip re-computation and improve latency.Use /voices/preview when:
- You want a single completed JSON response
- You need metrics and stored output metadata
- You want to inspect or store the audio data
Use /voices/preview/stream when:
- Low-latency playback matters
- Building a live preview player
- You don't need metadata alongside audio
Deployment Architecture
- Production domain:
api.msjoratio.com - Caddy terminates TLS publicly
- App binds locally on the VM; Caddy reverse proxies to the Voice API app port