Low-LPT entry path: AI inference is often a better starting point than solo video orchestration when stake is limited. Capability, pricing, and latency matter more than active-set position for many AI jobs.
How the network routes AI jobs
Applications never communicate with orchestrators directly. Every request flows through a gateway, which handles authentication, pricing negotiation, and routing to qualified nodes.How gateway selection actually works
Gateways discover orchestrators through theOrchestratorInfo structure, which your node broadcasts and updates on-chain. The key fields that determine whether you receive AI jobs are:
Gateway pricing is a hard gate. Gateways configure a maximum price they will pay per capability using the -maxPricePerCapability JSON flag. A pipeline priced above that maximum receives no jobs from that gateway, regardless of hardware quality.
Before setting prices in aiModels.json, check what prices the major gateways are using. See Models and VRAM Reference for a pricing reference table and Gateway Orchestrator Offerings for the full capability discovery protocol documentation.
For the complete list of supported pipelines and their model architectures, see AI Model Support in the Developers section.
The two workload types
The most important distinction for operators is between batch AI and live-video AI. These are different job types with different hardware profiles, different runtime architectures, and different operational characteristics.Batch AI
Request-response inference. An application sends a prompt or media file, your node processes it and returns the result. Includes text-to-image, audio-to-text, image-to-video, LLM completions, and more.
Cascade live-video AI
Continuous frame-by-frame video transformation. Live video streams in, processed video streams out with sub-100ms latency. Used for live AI effects, generative video overlays, and streaming AI agents.
Comparison
AI pipeline types
Livepeer’s AI worker supports ten pipeline types. Each pipeline handles a specific class of inference task, with its own model format, VRAM floor, and pricing unit.text-to-image — Generate images from text prompts
text-to-image — Generate images from text prompts
The most widely used batch AI pipeline on the network. Takes a text prompt and sampling parameters, returns a generated image.Minimum VRAM: 24 GB
Pricing unit: Per output pixel
Recommended model:
SG161222/RealVisXL_V4.0_Lightning
Typical hardware: RTX 3090, RTX 4090, A5000Diffusion models (Stable Diffusion, SDXL variants) run natively on the managed livepeer/ai-runner container. The Lightning and Turbo variants reduce step count to deliver results in under 2 seconds on an RTX 4090.Source: SG161222/RealVisXL_V4.0_Lightning on HuggingFaceimage-to-image — Style transfer and transformation
image-to-image — Style transfer and transformation
Takes an input image and applies diffusion-based transformation, style transfer, or enhancement. Used for artistic style application, image enhancement, and controlled generation.Minimum VRAM: 24 GB
Pricing unit: Per output pixel
Recommended model: SDXL variants,
ByteDance/SDXL-Lightning
Typical hardware: RTX 3090, RTX 4090image-to-video — Animate a still image
image-to-video — Animate a still image
Generates a short video clip from a single input image. Significantly more VRAM and compute-intensive than image-to-image.Minimum VRAM: 24 GB
Pricing unit: Per output pixel
Typical hardware: RTX 4090, A100
image-to-text — Vision-language captioning
image-to-text — Vision-language captioning
Takes an image and returns a text description. Lower VRAM floor makes this accessible to operators with older consumer cards.Minimum VRAM: 4 GB
Pricing unit: Per input pixel
Recommended model:
Salesforce/blip-image-captioning-large
Typical hardware: RTX 2060, GTX 1080 (as secondary pipeline)audio-to-text — Speech recognition and transcription
audio-to-text — Speech recognition and transcription
Runs Whisper-class speech recognition with timestamps. Widely used for transcription, captioning, and audio search.Minimum VRAM: 12 GB
Pricing unit: Per millisecond of audio
Recommended model:
openai/whisper-large-v3
Typical hardware: RTX 3060 12 GB, RTX 3080 10 GBSource: openai/whisper-large-v3 on HuggingFacesegment-anything-2 — Promptable segmentation
segment-anything-2 — Promptable segmentation
Pixel-level object segmentation using SAM2. Takes a prompt (point, box, or mask) and returns a segmentation mask over the input image or video frame.Recommended model: SAM2 variants
Source: facebookresearch/segment-anything-2 on GitHub
text-to-speech — Natural speech synthesis
text-to-speech — Natural speech synthesis
Converts text to natural speech audio. Growing use case for AI-generated video narration and interactive media.Pricing unit: Per character / per millisecond of output audio
upscale — Resolution enhancement
upscale — Resolution enhancement
Upscales low-resolution input to high resolution using diffusion-based super-resolution.Recommended model:
stabilityai/stable-diffusion-x4-upscaler
Pricing unit: Per input pixelllm — Large language model inference
llm — Large language model inference
OpenAI-compatible text completion endpoint backed by an Ollama-based runner. Runs quantised LLMs with as little as 8 GB VRAM, making it accessible to operators with older consumer GPUs that are unsuitable for diffusion pipelines.Minimum VRAM: 8 GB
Pricing unit: Per custom unit (typically per million tokens)
Recommended model:
meta-llama/Meta-Llama-3.1-8B-Instruct (via Ollama)
Typical hardware: GTX 1070 Ti, GTX 1080, RTX 2060The LLM pipeline uses a separate runner architecture from the standard livepeer/ai-runner image. See Batch AI Setup for the Ollama deployment guide.Source: Cloud SPE Ollama runner blog postlive-video-to-video — Cascade streaming AI
live-video-to-video — Cascade streaming AI
Continuous frame-by-frame transformation of live video streams. This pipeline takes a WebRTC stream as input and returns a transformed WebRTC stream with sub-100ms per-frame latency.Minimum VRAM: 24 GB recommended
Pricing unit: Per frame
Runtime:
livepeer/ai-runner:live-base + ComfyStream
Typical hardware: RTX 4090, A100, H100This pipeline powers the Cascade architecture — Livepeer’s live-video AI system. It supports live AI effects, live style transfer, and streaming AI agents.Source: ComfyStream on GitHubHardware by workload type
These are minimum requirements. Running at the minimum will result in longer cold-start times and reduced job competitiveness. The figures below reflect production-ready recommendations.
What you build and what the network supplies
The Livepeer protocol handles the hard parts of running an inference marketplace. As an orchestrator: You do need to:- Run and maintain GPU infrastructure
- Configure
aiModels.jsonwith your supported pipelines and pricing - Keep your primary models warm and your node performant
- Stay competitive on latency and pricing
- Build a marketplace or API
- Implement authentication or billing
- Handle service discovery
- Build brand recognition
Network participation
To verify your pipelines are visible to the network and check live capability coverage:- Network capabilities: tools.livepeer.cloud/ai/network-capabilities
- Orchestrator performance: explorer.livepeer.org
Watch: AI on Livepeer
Encode Club Live Video AI Bootcamp
Full session from the Q1 2025 bootcamp covering ComfyStream, live AI video pipelines, and orchestrator setup for Cascade workloads.
ComfyStream Demo
Live demonstration of ComfyStream running live-video AI effects through a Livepeer orchestrator.
Next steps
Batch AI Setup
Configure pipelines, aiModels.json, the Ollama LLM runner, and BYOC external containers.
Cascade Setup
Deploy the live-video-to-video pipeline with ComfyStream for live-video AI effects.
Model Hosting and VRAM
VRAM planning, warm model strategy, pricing, and aiModels.json reference.
Batch AI Setup
Upgrade path for existing transcoding orchestrators adding AI pipelines.