} ; }; export const CustomDivider = ({color = "var(--lp-color-border-default)", middleText = "", spacing = "default", style = {}, className = "", ...rest}) => { const spacingPresets = { default: { margin: "24px 0" }, overlap: { margin: "-1rem 0 -1rem 0" }, tight: { margin: "0 0 -1rem 0" }, section: { margin: "0 0 -2rem 0" }, sectionOverlap: { margin: "-1rem 0 -2rem 0" }, deepOverlap: { margin: "-1rem 0 -1.5rem 0" } }; const spacingStyle = spacingPresets[spacing] || spacingPresets.default; return

{middleText && <> {middleText} }

; }; Audio and vision pipelines have lower competition than diffusion pipelines. An operator who adds audio-to-text or image-to-text earns from a less saturated market while using GPU resources that would otherwise sit idle between diffusion jobs. *** Four non-diffusion, non-LLM pipelines are available on the Livepeer AI network: `audio-to-text`, `text-to-speech`, `image-to-text`, and `segment-anything-2`. All use the standard `livepeer/ai-runner` container – the same one diffusion pipelines use. Go-livepeer manages the container lifecycle automatically. Each pipeline has a different VRAM footprint and a different pricing unit. The entry below each section is the complete `aiModels.json` configuration required to enable it. ## Pipeline overview Pipeline VRAM Pricing unit Entry GPU `audio-to-text` \~3 GB (Whisper large-v3) Per millisecond of audio 12 GB recommended; runs on 8 GB `text-to-speech` Varies by model Per character or per ms of output audio 8 GB+ `image-to-text` \~1–2 GB (BLIP) Per input pixel 4 GB `segment-anything-2` 12–24 GB (variant-dependent) Per input pixel 12 GB+ ## audio-to-text (Whisper) `audio-to-text` transcribes audio to text with timestamps. The network-standard model is `openai/whisper-large-v3`, which most Gateway operators request by default. Running a non-standard model means fewer jobs routed your way. **VRAM:** \~3 GB warm\ **Pricing unit:** Per millisecond of audio input\ **Competitive note:** Whisper is VRAM-efficient. A 12 GB or 24 GB card supports a warm Whisper deployment alongside a diffusion model when those workloads are split across available GPU headroom. ### aiModels.json entry ```json icon="code" title="audio-to-text entry" theme={"theme":{"light":"github-light","dark":"dark-plus"}} { "pipeline": "audio-to-text", "model_id": "openai/whisper-large-v3", "price_per_unit": 12882811, "pixels_per_unit": 1, "warm": true } ``` `price_per_unit` here is in wei per millisecond of audio. `12882811` wei is approximately \$0.0000014 per second of audio at late-2025 ETH/USD rates. ### Testing After restarting the AI worker, check container health: ```bash icon="terminal" title="Check audio-to-text containers and logs" theme={"theme":{"light":"github-light","dark":"dark-plus"}} docker ps --filter name=livepeer-ai-runner docker logs --tail 50 ``` Verify registration: ```bash icon="terminal" title="Verify audio-to-text registration" theme={"theme":{"light":"github-light","dark":"dark-plus"}} # Your address should appear under audio-to-text at tools.livepeer.cloud ``` [tools.Livepeer.cloud/ai/network-capabilities](https://tools.livepeer.cloud/ai/network-capabilities) ## text-to-speech `text-to-speech` synthesises natural speech from text input. Growing demand as AI video narration use cases expand on the network. **VRAM:** Varies by model\ **Pricing unit:** Per character, or per millisecond of output audio (model-dependent)\ **Model:** `suno/bark` is the documented baseline model for this pipeline. ### aiModels.json entry ```json icon="code" title="text-to-speech entry" theme={"theme":{"light":"github-light","dark":"dark-plus"}} { "pipeline": "text-to-speech", "model_id": "suno/bark", "price_per_unit": 5960465 } ``` `price_per_unit` is in wei per pricing unit. Adjust based on the per-character or per-millisecond rate for your model. ### Testing After startup, verify the container is running and the pipeline appears registered at [tools.Livepeer.cloud/ai/network-capabilities](https://tools.livepeer.cloud/ai/network-capabilities) under `text-to-speech`. ## image-to-text `image-to-text` generates text descriptions from images using a vision-language model. The low VRAM requirement makes this the most accessible AI pipeline for operators without high-end GPUs. **VRAM:** \~1–2 GB (BLIP large)\ **Pricing unit:** Per input pixel\ **Entry point:** Runs on 4 GB GPUs. Operators below the 24 GB diffusion threshold still participate through `image-to-text` and `audio-to-text`. ### aiModels.json entry ```json icon="code" title="image-to-text entry" theme={"theme":{"light":"github-light","dark":"dark-plus"}} { "pipeline": "image-to-text", "model_id": "Salesforce/blip-image-captioning-large", "price_per_unit": 1192093, "warm": true } ``` `1192093` wei per input pixel is approximately \$0.000125 per megapixel at late-2025 ETH/USD rates. Image-to-text pricing is lower than diffusion pipelines because the compute cost is lower. ### Testing ```bash icon="terminal" title="Inspect image-to-text container logs" theme={"theme":{"light":"github-light","dark":"dark-plus"}} docker logs image-to-text-container --tail 30 ``` Check [tools.Livepeer.cloud/ai/network-capabilities](https://tools.livepeer.cloud/ai/network-capabilities) for registration status. ## segment-anything-2 `segment-anything-2` (SAM2) performs promptable segmentation – given an image or video frame and a point or bounding box prompt, it returns pixel masks for the identified object or region. The pipeline is compute-intensive and has lower competition than diffusion pipelines. **VRAM:** 12–24 GB depending on model variant\ **Pricing unit:** Per input pixel\ **Model variants:** SAM2 has multiple size variants. `facebook/sam2-hiera-large` is the standard choice. ### aiModels.json entry ```json icon="code" title="segment-anything-2 entry" theme={"theme":{"light":"github-light","dark":"dark-plus"}} { "pipeline": "segment-anything-2", "model_id": "facebook/sam2-hiera-large", "price_per_unit": 4768371 } ``` `segment-anything-2` usually stays cold until demand justifies the VRAM cost. The model then loads on the first request. ### Testing After the AI worker starts, verify the pipeline container is running: ```bash icon="terminal" title="Check segment-anything-2 containers" theme={"theme":{"light":"github-light","dark":"dark-plus"}} docker ps --filter name=livepeer-ai-runner ``` Check registration at [tools.Livepeer.cloud/ai/network-capabilities](https://tools.livepeer.cloud/ai/network-capabilities) under `segment-anything-2`. ## Running multiple pipelines Audio and vision pipelines run alongside diffusion pipelines on the same node when the VRAM budget supports them. Example configuration for a 24 GB card with diffusion warm and Whisper also warm: ```json icon="code" title="Multi-pipeline aiModels.json" theme={"theme":{"light":"github-light","dark":"dark-plus"}} [ { "pipeline": "text-to-image", "model_id": "SG161222/RealVisXL_V4.0_Lightning", "price_per_unit": 4768371, "warm": true }, { "pipeline": "audio-to-text", "model_id": "openai/whisper-large-v3", "price_per_unit": 12882811, "pixels_per_unit": 1, "warm": true }, { "pipeline": "image-to-text", "model_id": "Salesforce/blip-image-captioning-large", "price_per_unit": 1192093 } ] ``` In this configuration, `text-to-image` and `audio-to-text` are warm (both fit within 24 GB across their respective VRAM budgets). `image-to-text` is cold and loads on first request. During the Beta phase, only one warm model per GPU is supported. A single physical GPU therefore keeps either `text-to-image` or `audio-to-text` warm. Split them across separate GPUs, or keep one cold. Check logs for `Error loading warm model` at startup. ## Related pages The Ollama-based runner for text generation on 8 GB VRAM GPUs. text-to-image, image-to-image, image-to-video, and upscale pipeline configuration. Warm vs cold strategy, VRAM allocation, and optimisation flags. Per-pipeline pricing in aiModels.json, wei vs USD notation, and competitive positioning.