Audio-to-Text
Overview
The audio-to-text
pipeline converts audio from media files into text,
utilizing cutting-edge diffusion models from HuggingFace’s
automatic-speech-recognition (ASR) pipeline.
Models
Warm Models
The current warm model requested for the audio-to-text
pipeline is:
- openai/whisper-large-v3: Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
For faster responses with different
audio-to-text
diffusion models, ask Orchestrators to load it on their GPU via the ai-video
channel in Discord Server.
On-Demand Models
The following models have been tested and verified for the audio-to-text
pipeline:
If a specific model you wish to use is not listed, please submit a feature request on GitHub to get the model verified and added to the list.
Basic Usage Instructions
For a detailed understanding of the audio-to-text
endpoint and to experiment
with the API, see the Livepeer AI API
Reference.
To create an audio transcript using the audio-to-text
pipeline, submit a
POST
request to the Gateway’s audio-to-text
API endpoint:
curl -X POST "https://<GATEWAY_IP>/audio-to-text" \
-F model_id=openai/whisper-large-v3 \
-F audio=@<PATH_TO_FILE>
In this command:
<GATEWAY_IP>
should be replaced with your AI Gateway’s IP address.model_id
is the diffusion model for audio transcription.audio
is the path to the audio file to be transcribed.
- Supported file types:
mp4
,webm
,mp3
,flac
,wav
andm4a
- Maximum request size: 50 MB
For additional optional parameters, refer to the Livepeer AI API Reference.
After execution, the Orchestrator processes the request and returns the response to the Gateway:
{
"chunks": [
{
"text": " Explore the power of automatic speech recognition",
"timestamp": [0, 1.35]
},
{
"text": " By extracting the text from audio",
"timestamp": [1.35, 2.07]
}
],
"text": " Explore the power of automatic speech recognition By extracting the text from audio"
}
Orchestrator Configuration
To configure your Orchestrator to serve the audio-to-text
pipeline, refer to
the Orchestrator Configuration guide. The
following system requirements are recommended for optimal performance:
- NVIDIA GPU with at least 12GB of VRAM.
API Reference
API Reference
Explore the audio-to-text
endpoint and experiment with the API in the
Livepeer AI API Reference.
Was this page helpful?