Overview

The audio-to-text pipeline converts audio from media files into text, utilizing cutting-edge diffusion models from HuggingFace’s automatic-speech-recognition (ASR) pipeline.

Models

Warm Models

The current warm model requested for the audio-to-text pipeline is:

  • openai/whisper-large-v3: Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.

For faster responses with different audio-to-text diffusion models, ask Orchestrators to load it on their GPU via the ai-video channel in Discord Server.

On-Demand Models

The following models have been tested and verified for the audio-to-text pipeline:

If a specific model you wish to use is not listed, please submit a feature request on GitHub to get the model verified and added to the list.

Basic Usage Instructions

For a detailed understanding of the audio-to-text endpoint and to experiment with the API, see the AI Subnet API Reference.

To create an audio transcript using the audio-to-text pipeline, submit a POST request to the Gateway’s audio-to-text API endpoint:

curl -X POST "https://<gateway-ip>/audio-to-text" \
    -F model_id=openai/whisper-large-v3 \
    -F audio=@<PATH_TO_FILE>

In this command:

  • <gateway-ip> should be replaced with your AI Gateway’s IP address.
  • model_id is the diffusion model for image generation.
  • audio is the path to the audio file to be transcribed.
  • Supported file types: mp4, webm, mp3, flac, wav and m4a - Maximum request size: 50 MB

For additional optional parameters, refer to the AI Subnet API Reference.

After execution, the Orchestrator processes the request and returns the response to the Gateway:

{
    "chunks": [
        {
            "text": " Explore the power of automatic speech recognition",
            "timestamp": [
                0,
                1.35
            ]
        },
        {
            "text": " By extracting the text from audio",
            "timestamp": [
                1.35
                2.07
            ]
        }
    ],
    "text": " Explore the power of automatic speech recognition By extracting the text from audio"
}

API Reference

API Reference

Explore the audio-to-text endpoint and experiment with the API in the AI Subnet API Reference.