Overview

The audio-to-text pipeline converts audio from media files into text, utilizing cutting-edge diffusion models from HuggingFace’s automatic-speech-recognition (ASR) pipeline.

Models

Warm Models

The current warm model requested for the audio-to-text pipeline is:

  • openai/whisper-large-v3: Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.

For faster responses with different audio-to-text diffusion models, ask Orchestrators to load it on their GPU via the ai-video channel in Discord Server.

On-Demand Models

The following models have been tested and verified for the audio-to-text pipeline:

If a specific model you wish to use is not listed, please submit a feature request on GitHub to get the model verified and added to the list.

Basic Usage Instructions

For a detailed understanding of the audio-to-text endpoint and to experiment with the API, see the Livepeer AI API Reference.

To create an audio transcript using the audio-to-text pipeline, submit a POST request to the Gateway’s audio-to-text API endpoint:

curl -X POST "https://<GATEWAY_IP>/audio-to-text" \
    -F model_id=openai/whisper-large-v3 \
    -F audio=@<PATH_TO_FILE>

In this command:

  • <GATEWAY_IP> should be replaced with your AI Gateway’s IP address.
  • model_id is the diffusion model for audio transcription.
  • audio is the path to the audio file to be transcribed.
  • Supported file types: mp4, webm, mp3, flac, wav and m4a - Maximum request size: 50 MB

For additional optional parameters, refer to the Livepeer AI API Reference.

After execution, the Orchestrator processes the request and returns the response to the Gateway:

{
  "chunks": [
    {
      "text": " Explore the power of automatic speech recognition",
      "timestamp": [0, 1.35]
    },
    {
      "text": " By extracting the text from audio",
      "timestamp": [1.35, 2.07]
    }
  ],
  "text": " Explore the power of automatic speech recognition By extracting the text from audio"
}

Orchestrator Configuration

To configure your Orchestrator to serve the audio-to-text pipeline, refer to the Orchestrator Configuration guide. The following system requirements are recommended for optimal performance:

API Reference

API Reference

Explore the audio-to-text endpoint and experiment with the API in the Livepeer AI API Reference.