Four non-diffusion, non-LLM pipelines are available on the Livepeer AI network:
audio-to-text, text-to-speech, image-to-text, and segment-anything-2. All use the standard livepeer/ai-runner container — the same one diffusion pipelines use. go-livepeer manages the container lifecycle automatically.
Each pipeline has a different VRAM footprint and a different pricing unit. The entry below each section is the complete aiModels.json configuration required to enable it.
Pipeline overview
audio-to-text (Whisper)
audio-to-text transcribes audio to text with timestamps. The network-standard model is openai/whisper-large-v3, which most gateway operators request by default. Running a non-standard model means fewer jobs routed your way.
VRAM: ~3 GB warmPricing unit: Per millisecond of audio input
Competitive note: Whisper is VRAM-efficient. A 12 GB or 24 GB card supports a warm Whisper deployment alongside a diffusion model when those workloads are split across available GPU headroom.
aiModels.json entry
audio-to-text entry
price_per_unit here is in wei per millisecond of audio. 12882811 wei is approximately $0.0000014 per second of audio at late-2025 ETH/USD rates.
Testing
After restarting the AI worker, check container health:Check audio-to-text containers and logs
Verify audio-to-text registration
text-to-speech
text-to-speech synthesises natural speech from text input. Growing demand as AI video narration use cases expand on the network.
VRAM: Varies by modelPricing unit: Per character, or per millisecond of output audio (model-dependent)
Model:
suno/bark is the documented baseline model for this pipeline.
aiModels.json entry
text-to-speech entry
price_per_unit is in wei per pricing unit. Adjust based on the per-character or per-millisecond rate for your model.
Testing
After startup, verify the container is running and the pipeline appears registered at tools.livepeer.cloud/ai/network-capabilities undertext-to-speech.
image-to-text
image-to-text generates text descriptions from images using a vision-language model. The low VRAM requirement makes this the most accessible AI pipeline for operators without high-end GPUs.
VRAM: ~1–2 GB (BLIP large)Pricing unit: Per input pixel
Entry point: Runs on 4 GB GPUs. Operators below the 24 GB diffusion threshold still participate through
image-to-text and audio-to-text.
aiModels.json entry
image-to-text entry
1192093 wei per input pixel is approximately $0.000125 per megapixel at late-2025 ETH/USD rates. Image-to-text pricing is lower than diffusion pipelines because the compute cost is lower.
Testing
Inspect image-to-text container logs
segment-anything-2
segment-anything-2 (SAM2) performs promptable segmentation — given an image or video frame and a point or bounding box prompt, it returns pixel masks for the identified object or region. The pipeline is compute-intensive and has lower competition than diffusion pipelines.
VRAM: 12–24 GB depending on model variantPricing unit: Per input pixel
Model variants: SAM2 has multiple size variants.
facebook/sam2-hiera-large is the standard choice.
aiModels.json entry
segment-anything-2 entry
segment-anything-2 usually stays cold until demand justifies the VRAM cost. The model then loads on the first request.
Testing
After the AI worker starts, verify the pipeline container is running:Check segment-anything-2 containers
segment-anything-2.
Running multiple pipelines
Audio and vision pipelines run alongside diffusion pipelines on the same node when the VRAM budget supports them. Example configuration for a 24 GB card with diffusion warm and Whisper also warm:Multi-pipeline aiModels.json
text-to-image and audio-to-text are warm (both fit within 24 GB across their respective VRAM budgets). image-to-text is cold and loads on first request.
Related pages
LLM Pipeline Setup
The Ollama-based runner for text generation on 8 GB VRAM GPUs.
Diffusion Pipeline Setup
text-to-image, image-to-image, image-to-video, and upscale pipeline configuration.
AI Model Management
Warm vs cold strategy, VRAM allocation, and optimisation flags.
Pricing Strategy
Per-pipeline pricing in aiModels.json, wei vs USD notation, and competitive positioning.