Cascade Setup - Livepeer Docs

Cascade inference is fundamentally different from batch inference. Instead of processing a discrete request and returning a result, your node continuously transforms live video – receiving frames from a WebRTC stream, running inference on each frame, and streaming processed frames back with sub-100ms latency. This is the Cascade architecture: the live-video-to-video pipeline that powers live AI video effects, live style transfer, and streaming AI agents on the Livepeer Network.

What Cascade is

Cascade is Livepeer’s live-video AI processing pipeline. The name refers to the architecture – video streams Cascade through AI transformation nodes in the network, enabling live applications that previously required centralised infrastructure. Example applications on Cascade:

Daydream – generative AI video platform with live style application
StreamDiffusionTD – live diffusion via TouchDesigner
ComfyStream – browser-based ComfyUI pipelines with live video input
OBS plugins – live AI effects applied to streaming content

Source: Livepeer AI Subnet Announcement – Mirror.xyz · ComfyStream on GitHub · Daydream

How it differs from batch AI

The key difference is the continuous frame loop. Your pipeline receives frames as they arrive from the upstream WebRTC stream and must process and emit them quickly enough to avoid buffering. At 30 fps, you have 33 ms per frame budget. At 24 fps, 42 ms.

Prerequisites

Cascade has stricter hardware requirements than batch inference:

GPU: RTX 4090 (24 GB) strongly recommended. RTX 3090 (24 GB) is functional but with less headroom. A100/H100 for production multi-stream setups.
CPU: 8+ cores recommended. Frame decoding/encoding is CPU-bound.
Network: Low-latency connection. WebRTC streams are sensitive to packet loss and jitter.
CUDA: 12.0+
Docker with NVIDIA Container Toolkit
go-livepeer running with -aiWorker enabled

GPUs below 24 GB VRAM (e.g. RTX 3080 10 GB, RTX 3060 12 GB) are typically insufficient for Cascade inference at acceptable quality and frame rates. The combination of model weights, StreamDiffusion/ComfyUI overhead, and frame buffers exhausts available VRAM.

Architecture overview

Cascade architecture overview

Live video source (OBS, browser, camera)
         ↓
    WebRTC ingest
         ↓
   Livepeer Gateway
         ↓ routes stream to capable orchestrator
  go-livepeer AI Worker
         ↓ dispatches to live-video-to-video runner
  ai-runner:live-base container
         ↓ ComfyStream / custom pipeline
  Frame processing loop:
    receive frame → inference → emit frame
         ↓
  Processed WebRTC stream
         ↓
   Application / viewer

The Orchestrator receives the WebRTC stream, passes it to the AI Runner container, and streams the processed output back through the Gateway to the application. From the application’s perspective, it’s an input stream in and a transformed stream out.

ComfyStream – the live-video pipeline runtime

ComfyStream is the primary runtime for live-video AI inference on Livepeer. It wraps ComfyUI’s node-based workflow system and adapts it for continuous frame processing. What ComfyStream adds over standard ComfyUI:

WebRTC frame ingestion and emission
Async frame queue for continuous processing
Warm model management to avoid per-frame load latency
Livepeer AI worker integration via the Pipeline interface

Compatible model types:

StreamDiffusion (optimised for live diffusion at 30+ fps)
Standard SDXL / SD 1.5 (lower fps, higher quality)
ControlNet variants (depth, pose, sketch, canny)
IP-Adapter (style reference)
DepthAnything / MiDaS (depth estimation)
SAM2 (live segmentation)
Any ComfyUI-compatible model loaded as a DAG node

Source: Livepeer/ComfyStream on GitHub · ComfyUI-Stream-Pack

Setup

StreamDiffusion – live-video diffusion models

StreamDiffusion is the primary model architecture for live-video AI on Livepeer. It was designed specifically for continuous frame processing and achieves 30+ fps on an RTX 4090. How StreamDiffusion achieves live performance:

Stream Batch – processes multiple frames simultaneously as a batch, amortising model overhead across frames
Residual CFG – approximates classifier-free guidance with fewer forward passes
Stochastic Similarity Filter – skips inference on frames that are sufficiently similar to the previous frame
TinyVAE acceleration – uses a compressed VAE encoder/decoder for faster latency

Source: cumulo-autumn/StreamDiffusion on GitHub · StreamDiffusion paper on arXiv

ComfyUI workflow for StreamDiffusion

A typical ComfyStream workflow for live style application:

ComfyStream workflow example

{
  "nodes": [
    { "id": 1, "type": "LoadVideoInput", "outputs": ["frame"] },
    { "id": 2, "type": "VAEEncode", "inputs": ["frame"] },
    { "id": 3, "type": "StreamDiffusionSampler",
      "inputs": ["latent", "prompt", "model"],
      "config": { "num_inference_steps": 2, "guidance_scale": 1.2 }
    },
    { "id": 4, "type": "VAEDecode", "inputs": ["sampled_latent"] },
    { "id": 5, "type": "VideoOutput", "inputs": ["frame"] }
  ]
}

Steps as low as 2 support live performance. Quality vs latency is tunable through the workflow.

The Pipeline interface (custom pipelines)

For operators who want to build custom live-video AI processing beyond ComfyUI, the AI Runner exposes a Python Pipeline interface. Custom pipelines extend this interface and are packaged as Docker images extending livepeer/ai-runner:live-base.

Custom live-video Pipeline interface example

from runner.live.pipelines import Pipeline
from runner.live.trickle import VideoFrame, VideoOutput

class MyLiveVideoPipeline(Pipeline):
    async def initialize(self, **params):
        # Load your model here
        # Use asyncio.to_thread() for blocking model load operations
        self.model = await asyncio.to_thread(load_my_model, params)

    async def put_video_frame(self, frame: VideoFrame, request_id: str):
        # Process one frame - this is called continuously
        result = await asyncio.to_thread(self.model.predict, frame.tensor)
        await self.frame_queue.put(VideoOutput(result, request_id))

    async def get_processed_video_frame(self) -> VideoOutput:
        return await self.frame_queue.get()

    async def update_params(self, **params):
        # Called when pipeline parameters are updated mid-stream
        pass

    async def stop(self):
        # Clean shutdown
        pass

    @classmethod
    def prepare_models(cls):
        # Download/compile models — called once during operator setup
        snapshot_download("my-org/my-model", ...)

Source: ai-runner Pipeline interface · scope-runner reference implementation

Integration requirements for custom pipelines

Custom live-video pipelines require two code changes in the upstream repositories:

ai-worker/runner/dl_checkpoints.sh – add your pipeline to the model download script
go-livepeer/ai/worker/docker.go – add your pipeline to the container image map (livePipelineToImage)

These changes are required because the current pipeline registry still uses a static mapping. See the full custom pipeline development guide in the ai-runner.md documentation.

The Livepeer team is working toward a fully dynamic plugin architecture that eliminates these manual upstream changes. Track progress on the ai-worker GitHub repository.

Model types for live-video inference

StreamDiffusion (primary)

Best for: Continuous style application, generative video effects, live prompt-to-video

lcm-lora variants for fastest inference
SD 1.5 base with Lightning LoRA
SDXL Turbo at reduced resolution

ControlNet variants

ControlNet conditioning allows style transfer guided by structure maps extracted from the input frame: Source: DepthAnything on HuggingFace · DWPose · ControlNet paper

IP-Adapter (style reference)

IP-Adapter conditions generation on a reference image, enabling consistent style application across frames. Effective for brand-consistent visual transformation. Source: tencent-ailab/IP-Adapter on GitHub

Performance tuning

Maximising fps

Cascade performance is dominated by inference latency per frame. Key levers: Model selection: Use 1–2 step LCM or Lightning models instead of 20-step DDIM. The quality difference for streaming is acceptable; the latency difference is not. Resolution: Lower resolution dramatically increases fps. 512×512 at 30 fps is achievable on an RTX 4090 with StreamDiffusion. 768×768 drops to ~20 fps. 1024×1024 drops to ~12 fps. TensorRT compilation: For production deployments, compile the model to TensorRT engine format. One-time compilation overhead; 2–4× runtime speedup.

TensorRT compilation example

# In your Pipeline.prepare_models():
from torch2trt import torch2trt

model_trt = torch2trt(model, [example_input], fp16_mode=True)
torch.save(model_trt.state_dict(), 'model_trt.pth')

Stochastic Similarity Filter: Enable in StreamDiffusion to skip inference on static or slow-moving frames. This usually lifts fps with no visible quality loss for typical video content.

VRAM management

Unlike batch AI, live-video pipelines hold models in VRAM continuously for the duration of the stream. VRAM must be reserved for:

Model weights (~8–18 GB for SDXL-class)
Frame buffers (input + output, ~500 MB–1 GB per resolution)
ControlNet/LoRA adapters (~1–3 GB each)
Stream batch buffer (StreamDiffusion’s continuous frame queue)

On a 24 GB GPU, keep total loaded model + adapter weight below 20 GB to leave headroom for frame buffers and batch operations.

Multi-stream capacity

A single RTX 4090 usually handles 1–2 concurrent live-video streams depending on resolution and model complexity. For multi-stream capacity:

Multiple GPUs: The AI worker dispatches streams across multiple physical GPUs
Model parallelism: ComfyStream uses one GPU per stream
Scale-out: Run multiple Orchestrator instances, each handling 1–2 streams, behind a Gateway load balancer

Enterprise operators running 4090/A100 fleets monetise live-video AI at scale by advertising higher capacity values and maintaining multiple warm pipeline instances.

Watch: Cascade and live-video AI

Encode Club Live Video AI Bootcamp

Full Q1 2025 bootcamp session. Covers ComfyStream, Cascade architecture, and Orchestrator setup for live-video inference.

StreamDiffusion Demo

StreamDiffusion GitHub repository includes benchmark videos showing 30+ fps live style transfer.

Troubleshooting

Frames dropping or high latency

Model is too slow for target fps – try a lower-step model (LCM, Lightning) or reduce output resolution
VRAM OOM on frame buffer – reduce stream_batch_size in StreamDiffusion config
CPU bottleneck on encode/decode – WebRTC frame codec operations are CPU-bound; monitor CPU usage during streaming
Network jitter – WebRTC is sensitive to packet loss; check your upstream network quality

Restore live-video job flow

Confirm live-video-to-video appears on tools.Livepeer.cloud/ai/network-capabilities under your Orchestrator
Verify the live-base container is running and healthy: docker ps --filter name=livepeer-ai-runner-live
Check that your Orchestrator’s serviceAddr is reachable from Gateways – WebRTC ICE negotiation requires bidirectional reachability
Confirm your node has WebRTC port access (typically UDP 8935 or your configured port)

ComfyStream container failing to start

Check model weights are present at the expected path (-aiModelsDir location)
Check CUDA/driver compatibility – ComfyStream requires CUDA 12.0+
Run the container manually to see startup output:

Check CUDA access inside the live-base container

docker run --gpus all --rm livepeer/ai-runner:live-base python -c "import torch; print(torch.cuda.is_available())"

This confirms CUDA is accessible inside the container.

Custom pipeline registration issues

Verify livePipelineToImage in go-livepeer/ai/worker/docker.go includes your pipeline name
Confirm dl_checkpoints.sh in ai-runner includes your pipeline’s model preparation step
The model_id in aiModels.json must match the name field in your PipelineSpec exactly
After rebuilding and redeploying, re-register capabilities with the network

Batch AI Setup

Configure text-to-image, audio-to-text, LLM, and other batch pipelines.

Model Hosting and VRAM

VRAM planning, warm model strategy, and aiModels.json reference.

ComfyStream on GitHub

Source repository for ComfyStream, including setup scripts, example workflows, and community contributions.

AI Workloads Overview

Batch vs live-video AI, pipeline types, and network routing.

​What Cascade is

​How it differs from batch AI

​Prerequisites

​Architecture overview

​ComfyStream – the live-video pipeline runtime

​Setup

​StreamDiffusion – live-video diffusion models

​ComfyUI workflow for StreamDiffusion

​The Pipeline interface (custom pipelines)

​Integration requirements for custom pipelines

​Model types for live-video inference

​StreamDiffusion (primary)

​ControlNet variants

​IP-Adapter (style reference)

​Performance tuning

​Maximising fps

​VRAM management

​Multi-stream capacity

​Watch: Cascade and live-video AI

Encode Club Live Video AI Bootcamp

StreamDiffusion Demo

​Troubleshooting

​Related

Batch AI Setup

Model Hosting and VRAM

ComfyStream on GitHub

AI Workloads Overview

What Cascade is

How it differs from batch AI

Prerequisites

Architecture overview

ComfyStream – the live-video pipeline runtime

Setup

StreamDiffusion – live-video diffusion models

ComfyUI workflow for StreamDiffusion

The Pipeline interface (custom pipelines)

Integration requirements for custom pipelines

Model types for live-video inference

StreamDiffusion (primary)

ControlNet variants

IP-Adapter (style reference)

Performance tuning

Maximising fps

VRAM management

Multi-stream capacity

Watch: Cascade and live-video AI

Troubleshooting

Related