Text-to-Speech
Overview
The text-to-speech endpoint in Livepeer utilizes
Parler-TTS, specifically
parler-tts/parler-tts-large-v1
. This model can generate speech with
customizable characteristics such as voice type, speaking style, and audio
quality.
Basic Usage Instructions
For a detailed understanding of the text-to-speech
endpoint and to
experiment with the API, see the Livepeer AI API
Reference.
To use the text-to-speech feature, submit a POST request to the
/text-to-speech
endpoint. Here’s an example of how to structure your request:
Request Parameters
model_id
: The ID of the text-to-speech model to use. Currently, this should be set to"parler-tts/parler-tts-large-v1"
.text
: The text you want to convert to speech.description
: A description of the desired voice characteristics. This can include details about the speaker’s voice, speaking style, and audio quality.
Voice Customization
You can customize the generated voice by adjusting the description
parameter.
Some aspects you can control include:
- Speaker identity (e.g., “Jon’s voice”)
- Speaking style (e.g., “monotone”, “expressive”)
- Speaking speed (e.g., “slightly fast”)
- Audio quality (e.g., “very close recording”, “no background noise”)
The checkpoint was trained on 34 speakers. The full list of available speakers includes: Laura, Gary, Jon, Lea, Karen, Rick, Brenda, David, Eileen, Jordan, Mike, Yann, Joy, James, Eric, Lauren, Rose, Will, Jason, Aaron, Naomie, Alisa, Patrick, Jerry, Tina, Jenna, Bill, Tom, Carol, Barbara, Rebecca, Anna, Bruce, and Emily.
However, the models performed better with certain speakers. A list of the top 20 speakers for each model variant, ranked by their average speaker similarity scores can be found here
Limitations and Considerations
- The maximum length of the input text may be limited. For long-form content, you will need to split your text into smaller chunks. The training default configuration in parler-tts is max 30sec, max text length 600 characters. https://github.com/huggingface/parler-tts/blob/main/training/README.md#3-training
- While the model supports various voice characteristics, the exact replication of a specific speaker’s voice is not guaranteed.
- The quality of the generated speech can vary based on the complexity of the input text and the specificity of the voice description.
Orchestrator Configuration
To configure your Orchestrator to serve the text-to-speech
pipeline, refer to
the Orchestrator Configuration guide.
System Requirements
The following system requirements are recommended for optimal performance:
- NVIDIA GPU with at least 12GB of VRAM.
Recommended Pipeline Pricing
We are planning to simplify the pricing in the future so orchestrators can set one AI price per compute unit and have the system automatically scale based on the model’s compute requirements.
The pricing for the text-to-speech
pipeline is based on competitor pricing.
However, we strongly encourage orchestrators to set their own pricing based on
their costs and requirements. Setting a competitive price will help attract more
jobs, as Gateways can set their maximum price for a job. The current recommended
pricing for this pipeline is 1.5e-6 USD
per character.
Pipeline-Specific Image
To serve the text-to-speech
pipeline, you must use a pipeline specific AI
Runner container. Pull the required container from
Docker Hub
using the following command:
API Reference
API Reference
Explore the text-to-speech
endpoint and experiment with the API in the
Livepeer AI API Reference.
Was this page helpful?