Skip to main content
Audio and vision pipelines have lower competition than diffusion pipelines. An operator who adds audio-to-text or image-to-text earns from a less saturated market while using GPU resources that would otherwise sit idle between diffusion jobs.

Four non-diffusion, non-LLM pipelines are available on the Livepeer AI network: audio-to-text, text-to-speech, image-to-text, and segment-anything-2. All use the standard livepeer/ai-runner container — the same one diffusion pipelines use. go-livepeer manages the container lifecycle automatically. Each pipeline has a different VRAM footprint and a different pricing unit. The entry below each section is the complete aiModels.json configuration required to enable it.

Pipeline overview

audio-to-text (Whisper)

audio-to-text transcribes audio to text with timestamps. The network-standard model is openai/whisper-large-v3, which most gateway operators request by default. Running a non-standard model means fewer jobs routed your way. VRAM: ~3 GB warm
Pricing unit: Per millisecond of audio input
Competitive note: Whisper is VRAM-efficient. A 12 GB or 24 GB card supports a warm Whisper deployment alongside a diffusion model when those workloads are split across available GPU headroom.

aiModels.json entry

audio-to-text entry
{
  "pipeline": "audio-to-text",
  "model_id": "openai/whisper-large-v3",
  "price_per_unit": 12882811,
  "pixels_per_unit": 1,
  "warm": true
}
price_per_unit here is in wei per millisecond of audio. 12882811 wei is approximately $0.0000014 per second of audio at late-2025 ETH/USD rates.

Testing

After restarting the AI worker, check container health:
Check audio-to-text containers and logs
docker ps --filter name=livepeer-ai-runner
docker logs <audio-to-text-container> --tail 50
Verify registration:
Verify audio-to-text registration
# Your address should appear under audio-to-text at tools.livepeer.cloud
tools.livepeer.cloud/ai/network-capabilities

text-to-speech

text-to-speech synthesises natural speech from text input. Growing demand as AI video narration use cases expand on the network. VRAM: Varies by model
Pricing unit: Per character, or per millisecond of output audio (model-dependent)
Model: suno/bark is the documented baseline model for this pipeline.

aiModels.json entry

text-to-speech entry
{
  "pipeline": "text-to-speech",
  "model_id": "suno/bark",
  "price_per_unit": 5960465
}
price_per_unit is in wei per pricing unit. Adjust based on the per-character or per-millisecond rate for your model.

Testing

After startup, verify the container is running and the pipeline appears registered at tools.livepeer.cloud/ai/network-capabilities under text-to-speech.

image-to-text

image-to-text generates text descriptions from images using a vision-language model. The low VRAM requirement makes this the most accessible AI pipeline for operators without high-end GPUs. VRAM: ~1–2 GB (BLIP large)
Pricing unit: Per input pixel
Entry point: Runs on 4 GB GPUs. Operators below the 24 GB diffusion threshold still participate through image-to-text and audio-to-text.

aiModels.json entry

image-to-text entry
{
  "pipeline": "image-to-text",
  "model_id": "Salesforce/blip-image-captioning-large",
  "price_per_unit": 1192093,
  "warm": true
}
1192093 wei per input pixel is approximately $0.000125 per megapixel at late-2025 ETH/USD rates. Image-to-text pricing is lower than diffusion pipelines because the compute cost is lower.

Testing

Inspect image-to-text container logs
docker logs image-to-text-container --tail 30
Check tools.livepeer.cloud/ai/network-capabilities for registration status.

segment-anything-2

segment-anything-2 (SAM2) performs promptable segmentation — given an image or video frame and a point or bounding box prompt, it returns pixel masks for the identified object or region. The pipeline is compute-intensive and has lower competition than diffusion pipelines. VRAM: 12–24 GB depending on model variant
Pricing unit: Per input pixel
Model variants: SAM2 has multiple size variants. facebook/sam2-hiera-large is the standard choice.

aiModels.json entry

segment-anything-2 entry
{
  "pipeline": "segment-anything-2",
  "model_id": "facebook/sam2-hiera-large",
  "price_per_unit": 4768371
}
segment-anything-2 usually stays cold until demand justifies the VRAM cost. The model then loads on the first request.

Testing

After the AI worker starts, verify the pipeline container is running:
Check segment-anything-2 containers
docker ps --filter name=livepeer-ai-runner
Check registration at tools.livepeer.cloud/ai/network-capabilities under segment-anything-2.

Running multiple pipelines

Audio and vision pipelines run alongside diffusion pipelines on the same node when the VRAM budget supports them. Example configuration for a 24 GB card with diffusion warm and Whisper also warm:
Multi-pipeline aiModels.json
[
  {
    "pipeline": "text-to-image",
    "model_id": "SG161222/RealVisXL_V4.0_Lightning",
    "price_per_unit": 4768371,
    "warm": true
  },
  {
    "pipeline": "audio-to-text",
    "model_id": "openai/whisper-large-v3",
    "price_per_unit": 12882811,
    "pixels_per_unit": 1,
    "warm": true
  },
  {
    "pipeline": "image-to-text",
    "model_id": "Salesforce/blip-image-captioning-large",
    "price_per_unit": 1192093
  }
]
In this configuration, text-to-image and audio-to-text are warm (both fit within 24 GB across their respective VRAM budgets). image-to-text is cold and loads on first request.
During the Beta phase, only one warm model per GPU is supported. A single physical GPU therefore keeps either text-to-image or audio-to-text warm. Split them across separate GPUs, or keep one cold. Check logs for Error loading warm model at startup.
Last modified on March 16, 2026