Overview
Theaudio-to-text
pipeline converts audio from media files into text,
utilizing cutting-edge diffusion models from HuggingFace’s
automatic-speech-recognition (ASR) pipeline.
Models
Warm Models
The current warm model requested for theaudio-to-text
pipeline is:
- openai/whisper-large-v3: Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
For faster responses with different
audio-to-text
diffusion models, ask Orchestrators to load it on their GPU via the
ai-video
channel in Discord Server.On-Demand Models
The following models have been tested and verified for theaudio-to-text
pipeline:
If a specific model you wish to use is not listed, please submit a feature
request
on GitHub to get the model verified and added to the list.
Tested and Verified Diffusion Models
Tested and Verified Diffusion Models
- openai/whisper-large-v3: A high-performance ASR model by Open AI.
Basic Usage Instructions
For a detailed understanding of the
audio-to-text
endpoint and to experiment
with the API, see the Livepeer AI API
Reference.audio-to-text
pipeline, submit a
POST
request to the Gateway’s audio-to-text
API endpoint:
<GATEWAY_IP>
should be replaced with your AI Gateway’s IP address.model_id
is the diffusion model for audio transcription.audio
is the path to the audio file to be transcribed.
- Supported file types:
mp4
,webm
,mp3
,flac
,wav
andm4a
- Maximum request size: 50 MB
Orchestrator Configuration
To configure your Orchestrator to serve theaudio-to-text
pipeline, refer to
the Orchestrator Configuration guide.
System Requirements
The following system requirements are recommended for optimal performance:- NVIDIA GPU with at least 12GB of VRAM.
Recommended Pipeline Pricing
We are planning to simplify the pricing in the future so orchestrators can set
one AI price per compute unit and have the system automatically scale based on
the model’s compute requirements.
audio-to-text
pipeline is based on competitor pricing.
However, we strongly encourage orchestrators to set their own pricing based on
their costs and requirements. Setting a competitive price will help attract more
jobs, as Gateways can set their maximum price for a job. The currently
recommended pricing for this pipeline is 0.02e-6 USD
per milliseconds of
audio input.
API Reference
API Reference
Explore the
audio-to-text
endpoint and experiment with the API in the
Livepeer AI API Reference.