AI Inference Operations - Livepeer Docs

Livepeer’s AI subnet launched in Q3 2024 and has grown into a major source of new fee revenue for Orchestrators. It turns GPU nodes into open, composable inference infrastructure that serves image generation, live-video effects, and large language model completions. AI workloads reach your node through Gateway routing, capability advertisement, and container-based inference. The core operator distinction is between batch inference and live-video inference because the hardware profile and routing logic differ.

Low-LPT entry path: AI inference is often a better starting point than solo video orchestration when stake is limited. Capability, pricing, and latency matter more than active-set position for many AI jobs.

How the network routes AI jobs

Applications never communicate with Orchestrators directly. Every request flows through a Gateway, which handles authentication, pricing negotiation, and routing to qualified nodes.

Application / User
       ↓
Gateway (routing, auth, pricing, QoS)
       ↓
Orchestrator (go-livepeer + AI Worker)
       ↓
AI Runner container (your GPU + model)
       ↓
Result returned through gateway

Your Orchestrator advertises capabilities – which pipelines it supports and at what price – and Gateways route matching jobs to it. You never build a marketplace, billing system, or authentication layer. You run excellent inference infrastructure. When a Gateway selects your Orchestrator, it is because your combination of capability, pricing, latency, and uptime made you the best option for that specific request.

How Gateway selection actually works

Gateways discover Orchestrators through the OrchestratorInfo structure, which your node broadcasts and updates on-chain. The key fields that determine whether you receive AI jobs are: Gateway pricing is a hard gate. Gateways configure a maximum price they will pay per capability using the -maxPricePerCapability JSON flag. A pipeline priced above that maximum receives no jobs from that Gateway, regardless of hardware quality. Before setting prices in aiModels.json, check what prices the major Gateways are using. See Models and VRAM Reference for a pricing reference table and Gateway Orchestrator Offerings for the full capability discovery protocol documentation. For the complete list of supported pipelines and their model architectures, see AI Model Support in the Developers section.

The two workload types

The most important distinction for operators is between batch AI and live-video AI. These are different job types with different hardware profiles, different runtime architectures, and different operational characteristics.

Batch AI

Request-response inference. An application sends a prompt or media file, your node processes it and returns the result. Includes text-to-image, audio-to-text, image-to-video, LLM completions, and more.

Cascade live-video AI

Continuous frame-by-frame video transformation. Live video streams in, processed video streams out with sub-100ms latency. Used for live AI effects, generative video overlays, and streaming AI agents.

Comparison

AI pipeline types

Livepeer’s AI worker supports ten pipeline types. Each pipeline handles a specific class of inference task, with its own model format, VRAM floor, and pricing unit.

text-to-image - Generate images from text prompts

The most widely used batch AI pipeline on the network. Takes a text prompt and sampling parameters, returns a generated image.Minimum VRAM: 24 GB Pricing unit: Per output pixel Recommended model: SG161222/RealVisXL_V4.0_Lightning Typical hardware: RTX 3090, RTX 4090, A5000Diffusion models (Stable Diffusion, SDXL variants) run natively on the managed livepeer/ai-runner container. The Lightning and Turbo variants reduce step count to deliver results in under 2 seconds on an RTX 4090.Source: SG161222/RealVisXL_V4.0_Lightning on HuggingFace

image-to-image - Style transfer and transformation

Takes an input image and applies diffusion-based transformation, style transfer, or enhancement. Used for artistic style application, image enhancement, and controlled generation.Minimum VRAM: 24 GB Pricing unit: Per output pixel Recommended model: SDXL variants, ByteDance/SDXL-Lightning Typical hardware: RTX 3090, RTX 4090

image-to-video - Animate a still image

Generates a short video clip from a single input image. Significantly more VRAM and compute-intensive than image-to-image.Minimum VRAM: 24 GB Pricing unit: Per output pixel Typical hardware: RTX 4090, A100

image-to-text - Vision-language captioning

Takes an image and returns a text description. Lower VRAM floor makes this accessible to operators with older consumer cards.Minimum VRAM: 4 GB Pricing unit: Per input pixel Recommended model: Salesforce/blip-image-captioning-large Typical hardware: RTX 2060, GTX 1080 (as secondary pipeline)

audio-to-text - Speech recognition and transcription

Runs Whisper-class speech recognition with timestamps. Widely used for transcription, captioning, and audio search.Minimum VRAM: 12 GB Pricing unit: Per millisecond of audio Recommended model: openai/whisper-large-v3 Typical hardware: RTX 3060 12 GB, RTX 3080 10 GBSource: openai/whisper-large-v3 on HuggingFace

segment-anything-2 - Promptable segmentation

Pixel-level object segmentation using SAM2. Takes a prompt (point, box, or mask) and returns a segmentation mask over the input image or video frame.Recommended model: SAM2 variants Source: facebookresearch/segment-anything-2 on GitHub

text-to-speech - Natural speech synthesis

Converts text to natural speech audio. Growing use case for AI-generated video narration and interactive media.Pricing unit: Per character / per millisecond of output audio

upscale - Resolution enhancement

Upscales low-resolution input to high resolution using diffusion-based super-resolution.Recommended model: stabilityai/stable-diffusion-x4-upscaler Pricing unit: Per input pixel

llm - Large language model inference

OpenAI-compatible text completion endpoint backed by an Ollama-based runner. Runs quantised LLMs with as little as 8 GB VRAM, making it accessible to operators with older consumer GPUs that are unsuitable for diffusion pipelines.Minimum VRAM: 8 GB Pricing unit: Per custom unit (typically per million tokens) Recommended model: meta-llama/Meta-Llama-3.1-8B-Instruct (via Ollama) Typical hardware: GTX 1070 Ti, GTX 1080, RTX 2060The LLM pipeline uses a separate runner architecture from the standard livepeer/ai-runner image. See Batch AI Setup for the Ollama deployment guide.Source: Cloud SPE Ollama runner blog post

live-video-to-video - Cascade streaming AI

Continuous frame-by-frame transformation of live video streams. This pipeline takes a WebRTC stream as input and returns a transformed WebRTC stream with sub-100ms per-frame latency.Minimum VRAM: 24 GB recommended Pricing unit: Per frame Runtime: livepeer/ai-runner:live-base + ComfyStream Typical hardware: RTX 4090, A100, H100This pipeline powers the Cascade architecture – Livepeer’s live-video AI system. It supports live AI effects, live style transfer, and streaming AI agents.Source: ComfyStream on GitHub

Hardware by workload type

These are minimum requirements. Running at the minimum will result in longer cold-start times and reduced job competitiveness. The figures below reflect production-ready recommendations.

For detailed VRAM planning, warm model strategy, and multi-pipeline configuration, see Model Hosting and VRAM Planning.

What you build and what the network supplies

The Livepeer Protocol handles the hard parts of running an inference marketplace. : You do need to:

Run and maintain GPU infrastructure
Configure aiModels.json with your supported pipelines and pricing
Keep your primary models warm and your node performant
Stay competitive on latency and pricing

The network already supplies:

Build a marketplace or API
Implement authentication or billing
Handle service discovery
Build brand recognition

Gateways provide all of that. Your competitive advantage is performance: lower latency, better-tuned models, higher uptime, specialised capabilities.

Network participation

To verify your pipelines are visible to the network and check live capability coverage:

Network capabilities: tools.Livepeer.cloud/ai/network-capabilities
Orchestrator performance: explorer.livepeer.org

The network capabilities tool shows all registered Orchestrators and their advertised pipelines currently visible on the network. Before your Orchestrator receives AI jobs, it must appear there under at least one pipeline.

Watch: AI on Livepeer

Encode Club Live Video AI Bootcamp

Full session from the Q1 2025 bootcamp covering ComfyStream, live AI video pipelines, and Orchestrator setup for Cascade workloads.

ComfyStream Demo

Live demonstration of ComfyStream running live-video AI effects through a Livepeer Orchestrator.

Next steps

Batch AI Setup

Configure pipelines, aiModels.json, the Ollama LLM runner, and BYOC external containers.

Cascade Setup

Deploy the live-video-to-video pipeline with ComfyStream for live-video AI effects.

Model Hosting and VRAM

VRAM planning, warm model strategy, pricing, and aiModels.json reference.

Batch AI Setup

Upgrade path for existing transcoding Orchestrators adding AI pipelines.

​How the network routes AI jobs

​How Gateway selection actually works

​The two workload types

Batch AI

Cascade live-video AI

​Comparison

​AI pipeline types

​Hardware by workload type

​What you build and what the network supplies

​Network participation

​Watch: AI on Livepeer

Encode Club Live Video AI Bootcamp

ComfyStream Demo

​Next steps

Batch AI Setup

Cascade Setup

Model Hosting and VRAM

Batch AI Setup

How the network routes AI jobs

How Gateway selection actually works

The two workload types

Comparison

AI pipeline types

Hardware by workload type

What you build and what the network supplies

Network participation

Watch: AI on Livepeer

Next steps