Audio and Vision Pipelines

Audio and vision pipelines have lower competition than diffusion pipelines. An operator who adds audio-to-text or image-to-text earns from a less saturated market while using GPU resources that would otherwise sit idle between diffusion jobs.

Four non-diffusion, non-LLM pipelines are available on the Livepeer AI network: audio-to-text, text-to-speech, image-to-text, and segment-anything-2. All use the standard livepeer/ai-runner container – the same one diffusion pipelines use. Go-livepeer manages the container lifecycle automatically. Each pipeline has a different VRAM footprint and a different pricing unit. The entry below each section is the complete aiModels.json configuration required to enable it.

Pipeline overview

audio-to-text (Whisper)

audio-to-text transcribes audio to text with timestamps. The network-standard model is openai/whisper-large-v3, which most Gateway operators request by default. Running a non-standard model means fewer jobs routed your way. VRAM: ~3 GB warm
Pricing unit: Per millisecond of audio input
Competitive note: Whisper is VRAM-efficient. A 12 GB or 24 GB card supports a warm Whisper deployment alongside a diffusion model when those workloads are split across available GPU headroom.

aiModels.json entry

audio-to-text entry

{
  "pipeline": "audio-to-text",
  "model_id": "openai/whisper-large-v3",
  "price_per_unit": 12882811,
  "pixels_per_unit": 1,
  "warm": true
}

price_per_unit here is in wei per millisecond of audio. 12882811 wei is approximately $0.0000014 per second of audio at late-2025 ETH/USD rates.

Testing

After restarting the AI worker, check container health:

Check audio-to-text containers and logs

docker ps --filter name=livepeer-ai-runner
docker logs <audio-to-text-container> --tail 50

Verify registration:

Verify audio-to-text registration

# Your address should appear under audio-to-text at tools.livepeer.cloud

tools.Livepeer.cloud/ai/network-capabilities

text-to-speech

text-to-speech synthesises natural speech from text input. Growing demand as AI video narration use cases expand on the network. VRAM: Varies by model
Pricing unit: Per character, or per millisecond of output audio (model-dependent)
Model: suno/bark is the documented baseline model for this pipeline.

aiModels.json entry

text-to-speech entry

{
  "pipeline": "text-to-speech",
  "model_id": "suno/bark",
  "price_per_unit": 5960465
}

price_per_unit is in wei per pricing unit. Adjust based on the per-character or per-millisecond rate for your model.

Testing

After startup, verify the container is running and the pipeline appears registered at tools.Livepeer.cloud/ai/network-capabilities under text-to-speech.

image-to-text

image-to-text generates text descriptions from images using a vision-language model. The low VRAM requirement makes this the most accessible AI pipeline for operators without high-end GPUs. VRAM: ~1–2 GB (BLIP large)
Pricing unit: Per input pixel
Entry point: Runs on 4 GB GPUs. Operators below the 24 GB diffusion threshold still participate through image-to-text and audio-to-text.

aiModels.json entry

image-to-text entry

{
  "pipeline": "image-to-text",
  "model_id": "Salesforce/blip-image-captioning-large",
  "price_per_unit": 1192093,
  "warm": true
}

1192093 wei per input pixel is approximately $0.000125 per megapixel at late-2025 ETH/USD rates. Image-to-text pricing is lower than diffusion pipelines because the compute cost is lower.

Testing

Inspect image-to-text container logs

docker logs image-to-text-container --tail 30

Check tools.Livepeer.cloud/ai/network-capabilities for registration status.

segment-anything-2

segment-anything-2 (SAM2) performs promptable segmentation – given an image or video frame and a point or bounding box prompt, it returns pixel masks for the identified object or region. The pipeline is compute-intensive and has lower competition than diffusion pipelines. VRAM: 12–24 GB depending on model variant
Pricing unit: Per input pixel
Model variants: SAM2 has multiple size variants. facebook/sam2-hiera-large is the standard choice.

aiModels.json entry

segment-anything-2 entry

{
  "pipeline": "segment-anything-2",
  "model_id": "facebook/sam2-hiera-large",
  "price_per_unit": 4768371
}

segment-anything-2 usually stays cold until demand justifies the VRAM cost. The model then loads on the first request.

Testing

After the AI worker starts, verify the pipeline container is running:

Check segment-anything-2 containers

docker ps --filter name=livepeer-ai-runner

Check registration at tools.Livepeer.cloud/ai/network-capabilities under segment-anything-2.

Running multiple pipelines

Audio and vision pipelines run alongside diffusion pipelines on the same node when the VRAM budget supports them. Example configuration for a 24 GB card with diffusion warm and Whisper also warm:

Multi-pipeline aiModels.json

[
  {
    "pipeline": "text-to-image",
    "model_id": "SG161222/RealVisXL_V4.0_Lightning",
    "price_per_unit": 4768371,
    "warm": true
  },
  {
    "pipeline": "audio-to-text",
    "model_id": "openai/whisper-large-v3",
    "price_per_unit": 12882811,
    "pixels_per_unit": 1,
    "warm": true
  },
  {
    "pipeline": "image-to-text",
    "model_id": "Salesforce/blip-image-captioning-large",
    "price_per_unit": 1192093
  }
]

In this configuration, text-to-image and audio-to-text are warm (both fit within 24 GB across their respective VRAM budgets). image-to-text is cold and loads on first request.

During the Beta phase, only one warm model per GPU is supported. A single physical GPU therefore keeps either text-to-image or audio-to-text warm. Split them across separate GPUs, or keep one cold. Check logs for Error loading warm model at startup.

LLM Pipeline Setup

The Ollama-based runner for text generation on 8 GB VRAM GPUs.

Diffusion Pipeline Setup

text-to-image, image-to-image, image-to-video, and upscale pipeline configuration.

AI Model Management

Warm vs cold strategy, VRAM allocation, and optimisation flags.

Pricing Strategy

Per-pipeline pricing in aiModels.json, wei vs USD notation, and competitive positioning.

​Pipeline overview

​audio-to-text (Whisper)

​aiModels.json entry

​Testing

​text-to-speech

​aiModels.json entry

​Testing

​image-to-text

​aiModels.json entry

​Testing

​segment-anything-2

​aiModels.json entry

​Testing

​Running multiple pipelines

​Related pages

LLM Pipeline Setup

Diffusion Pipeline Setup

AI Model Management

Pricing Strategy

Pipeline overview

audio-to-text (Whisper)

aiModels.json entry

Testing

text-to-speech

aiModels.json entry

Testing

image-to-text

aiModels.json entry

Testing

segment-anything-2

aiModels.json entry

Testing

Running multiple pipelines

Related pages