Voice API

Standalone Voice API

The recommended public-facing product surface for building voice features into paid applications. Predictable auth, direct preview calls, and a realtime stream endpoint for live UX.

Base URL

https://api.msjoratio.com

All Voice API requests are made to this production hostname. The service runs behind a Caddy reverse proxy with TLS termination. For the platform REST API (TTS generation, keys, billing) see the Platform API Reference.

Authentication

Two authentication headers are accepted. Use either format:

X-API-Key: <VOICE_API_AUTH_TOKEN>

Authorization: Bearer <VOICE_API_AUTH_TOKEN>

Product Recommendations

One project-level API key per workspace or account
Token-based balance and rate-limit rules enforced by your backend or gateway
Browser apps should use a backend proxy or short-lived token model — never embed permanent keys in client bundles

Endpoint Families

13 endpoints across 5 families: health, voice lifecycle, conditioning, preview generation, and validation.

Method	Path	Purpose
GET	/health	Liveness and storage path summary
GET	/version	Service and XTTS model version
POST	/voices/analyze	Analyze uploaded audio
POST	/voices/create	Create voice identity from uploaded audio
POST	/voices/reference-stack	Build merged reference stack
POST	/voices/embedding	Build reference assets
POST	/voices/conditioning	Generate or fetch conditioning payloads
POST	/voices/validate	Validate preview against voice
POST	/voices/preview	Generate completed preview response
POST	/voices/preview/stream	Realtime MP3 streaming preview
GET	/voices/{voice_id}	Fetch registry entry
GET	/voices/{voice_id}/manifest	Fetch manifest
POST	/voices/{voice_id}/bundle	Export bundle

Admin Operations

Voice lifecycle management is handled through the admin portal and internal tooling:

Provision platform voices from reference audio

Rebuild conditioning embeddings and reference stacks

Upgrade or downgrade voice quality tiers

Train, fine-tune, and promote core XTTS engine versions

Open the admin provisioning console at /admin/voice-provisioning.

Health & Version

GET/health

Returns a lightweight service report: service name, version, data root path, and voices directory path. Use for liveness probes and deployment verification.

GET/version

Returns the service version and XTTS model identifier.

Voice Lifecycle Endpoints

POST/voices/analyze

Quick technical analysis of an audio file before promoting it into a voice identity. Use this for quality checks and audio inspection.

POST/voices/create

The primary voice registration entrypoint. Creates a voice identity from uploaded audio:

Field	Type	Description
voice_id	string	Unique identifier for the voice
audio	file	Uploaded audio file (WAV/MP3)
display_name	string	Human-readable voice label (optional)
style_tags	string[]	Descriptive style tags (optional)
consent_mode	string	Consent status for the voice (optional)
notes	string	Internal notes (optional)

POST/voices/reference-stack

Builds a cleaned and merged reference stack from multiple uploaded audio files. Produces a more stable identity source than a single clip.

POST/voices/embedding

Builds reusable reference assets (speaker embedding) for a given voice and version. These can be passed to preview endpoints to skip re-computation.

POST/voices/conditioning

Returns speaker embedding and GPT conditioning latent payloads. Can rebuild them when missing. Supports deterministic XTTS reuse instead of recomputing identity assets on every call.

GET/voices/{voice_id}

Fetch the registry entry for a specific voice. Returns voice metadata, version info, and configuration.

GET/voices/{voice_id}/manifest

Fetch the full manifest for a voice, including all versions, reference stacks, and asset metadata.

Preview Endpoints

Request Parameters (both preview endpoints)

Name	Type	Required	Description
voice_id	string	Required	Voice identifier from the registry
text	string	Required	Text to synthesize into speech
version	string	Optional	Voice version to use. Defaults to latest.
include_audio_base64	boolean	Optional	If true, includes base64-encoded audio in the response (JSON preview only)
persist_preview	boolean	Optional	If true, saves the preview to server-side storage
speech_rate	float	Optional	Speech rate multiplier. Default 1.0
temperature	float	Optional	Sampling temperature for generation. Default 0.7
top_p	float	Optional	Nucleus sampling parameter. Default 0.9
speaker_embedding_json	string	Optional	Pre-computed speaker embedding JSON. Skips embedding computation.
gpt_cond_latent_json	string	Optional	Pre-computed GPT conditioning latent JSON. Skips conditioning computation.

POST/voices/preview

Returns a completed preview payload. Response includes output path, generation metrics, optional audio_base64, voice ID and version.

Best for:

Waveform analysis and metadata display
Easier persistence and debugging
History and metrics panels
Storing output for later replay

Response Schema

{
  "voice_id": "default_narrator",
  "version": "v1",
  "output_path": "/data/previews/abc123.wav",
  "audio_base64": "UklGRi4A...",
  "metrics": {
    "duration_seconds": 2.3,
    "generation_time_ms": 480,
    "rtf": 0.21
  }
}

cURL Example

curl -X POST "https://api.msjoratio.com/voices/preview" \ -H "Content-Type: application/json" \ -H "X-API-Key: $VOICE_API_AUTH_TOKEN" \ -d '{ "voice_id": "default_narrator", "text": "Hello from the Voice API.", "include_audio_base64": true, "persist_preview": true, "speech_rate": 1.0 }'

JavaScript Example

POST/voices/preview/stream

Streams audio/mpeg bytes in realtime with Cache-Control: no-store. Custom response headers include voice ID and version.

Best for:

Immediate browser playback
Low-latency preview UIs
Streaming QA workflows
“Listen before you spend” modals

Response Details

Content-Type: audio/mpeg
Cache-Control: no-store
X-Voice-Id: voice identifier
X-Voice-Version: version used

cURL Example

curl -X POST "https://api.msjoratio.com/voices/preview/stream" \ -H "Content-Type: application/json" \ -H "X-API-Key: $VOICE_API_AUTH_TOKEN" \ --output preview.mp3 \ -d '{ "voice_id": "default_narrator", "text": "This is a realtime preview stream.", "speech_rate": 1.0 }'

JavaScript Example

Typical UI Patterns

"Generate preview" player

"Listen before you spend" modal

Voice picker with live stream

Validation & Bundle

POST/voices/validate

Compares a preview clip against a stored voice identity and returns a validation report. Use to verify that generated audio matches the expected voice profile.

POST/voices/{voice_id}/bundle

Exports a distributable or archival bundle for a voice version. Includes all reference assets, embeddings, and metadata.

Data Model

The core request model for previews includes these key fields. All are passed in the JSON request body.

voice_idtextversioninclude_audio_base64persist_previewspeech_ratetemperaturetop_pspeaker_embedding_jsongpt_cond_latent_json

Tip: Pre-compute speaker_embedding_json and gpt_cond_latent_json via the /voices/conditioning endpoint, then pass them to preview calls to skip re-computation and improve latency.

Operational Notes

Use /voices/preview when:

You want a single completed JSON response
You need metrics and stored output metadata
You want to inspect or store the audio data

Use /voices/preview/stream when:

Low-latency playback matters
Building a live preview player
You don't need metadata alongside audio

Deployment Architecture

Production domain: api.msjoratio.com
Caddy terminates TLS publicly
App binds locally on the VM; Caddy reverse proxies to the Voice API app port

Frontend Developer Guide

Pricing & Token Model