Skip to main content
Cascade inference is fundamentally different from batch inference. Instead of processing a discrete request and returning a result, your node continuously transforms live video — receiving frames from a WebRTC stream, running inference on each frame, and streaming processed frames back with sub-100ms latency. This is the Cascade architecture: the live-video-to-video pipeline that powers live AI video effects, live style transfer, and streaming AI agents on the Livepeer network.

What Cascade is

Cascade is Livepeer’s live-video AI processing pipeline. The name refers to the architecture — video streams cascade through AI transformation nodes in the network, enabling live applications that previously required centralised infrastructure. Example applications on Cascade:
  • Daydream — generative AI video platform with live style application
  • StreamDiffusionTD — live diffusion via TouchDesigner
  • ComfyStream — browser-based ComfyUI pipelines with live video input
  • OBS plugins — live AI effects applied to streaming content
Source: Livepeer AI Subnet Announcement — Mirror.xyz · ComfyStream on GitHub · Daydream

How it differs from batch AI

The key difference is the continuous frame loop. Your pipeline receives frames as they arrive from the upstream WebRTC stream and must process and emit them quickly enough to avoid buffering. At 30 fps, you have 33 ms per frame budget. At 24 fps, 42 ms.

Prerequisites

Cascade has stricter hardware requirements than batch inference:
  • GPU: RTX 4090 (24 GB) strongly recommended. RTX 3090 (24 GB) is functional but with less headroom. A100/H100 for production multi-stream setups.
  • CPU: 8+ cores recommended. Frame decoding/encoding is CPU-bound.
  • Network: Low-latency connection. WebRTC streams are sensitive to packet loss and jitter.
  • CUDA: 12.0+
  • Docker with NVIDIA Container Toolkit
  • go-livepeer running with -aiWorker enabled
GPUs below 24 GB VRAM (e.g. RTX 3080 10 GB, RTX 3060 12 GB) are typically insufficient for Cascade inference at acceptable quality and frame rates. The combination of model weights, StreamDiffusion/ComfyUI overhead, and frame buffers exhausts available VRAM.

Architecture overview

Cascade architecture overview
Live video source (OBS, browser, camera)

    WebRTC ingest

   Livepeer Gateway
         ↓ routes stream to capable orchestrator
  go-livepeer AI Worker
         ↓ dispatches to live-video-to-video runner
  ai-runner:live-base container
         ↓ ComfyStream / custom pipeline
  Frame processing loop:
    receive frame → inference → emit frame

  Processed WebRTC stream

   Application / viewer
The orchestrator receives the WebRTC stream, passes it to the AI runner container, and streams the processed output back through the gateway to the application. From the application’s perspective, it’s an input stream in and a transformed stream out.

ComfyStream — the live-video pipeline runtime

ComfyStream is the primary runtime for live-video AI inference on Livepeer. It wraps ComfyUI’s node-based workflow system and adapts it for continuous frame processing. What ComfyStream adds over standard ComfyUI:
  • WebRTC frame ingestion and emission
  • Async frame queue for continuous processing
  • Warm model management to avoid per-frame load latency
  • Livepeer AI worker integration via the Pipeline interface
Compatible model types:
  • StreamDiffusion (optimised for live diffusion at 30+ fps)
  • Standard SDXL / SD 1.5 (lower fps, higher quality)
  • ControlNet variants (depth, pose, sketch, canny)
  • IP-Adapter (style reference)
  • DepthAnything / MiDaS (depth estimation)
  • SAM2 (live segmentation)
  • Any ComfyUI-compatible model loaded as a DAG node
Source: livepeer/comfystream on GitHub · ComfyUI-Stream-Pack

Setup

StreamDiffusion — live-video diffusion models

StreamDiffusion is the primary model architecture for live-video AI on Livepeer. It was designed specifically for continuous frame processing and achieves 30+ fps on an RTX 4090. How StreamDiffusion achieves live performance:
  1. Stream Batch — processes multiple frames simultaneously as a batch, amortising model overhead across frames
  2. Residual CFG — approximates classifier-free guidance with fewer forward passes
  3. Stochastic Similarity Filter — skips inference on frames that are sufficiently similar to the previous frame
  4. TinyVAE acceleration — uses a compressed VAE encoder/decoder for faster latency
Source: cumulo-autumn/StreamDiffusion on GitHub · StreamDiffusion paper on arXiv

ComfyUI workflow for StreamDiffusion

A typical ComfyStream workflow for live style application:
ComfyStream workflow example
{
  "nodes": [
    { "id": 1, "type": "LoadVideoInput", "outputs": ["frame"] },
    { "id": 2, "type": "VAEEncode", "inputs": ["frame"] },
    { "id": 3, "type": "StreamDiffusionSampler",
      "inputs": ["latent", "prompt", "model"],
      "config": { "num_inference_steps": 2, "guidance_scale": 1.2 }
    },
    { "id": 4, "type": "VAEDecode", "inputs": ["sampled_latent"] },
    { "id": 5, "type": "VideoOutput", "inputs": ["frame"] }
  ]
}
Steps as low as 2 support live performance. Quality vs latency is tunable through the workflow.

The Pipeline interface (custom pipelines)

For operators who want to build custom live-video AI processing beyond ComfyUI, the AI runner exposes a Python Pipeline interface. Custom pipelines extend this interface and are packaged as Docker images extending livepeer/ai-runner:live-base.
Custom live-video Pipeline interface example
from runner.live.pipelines import Pipeline
from runner.live.trickle import VideoFrame, VideoOutput

class MyLiveVideoPipeline(Pipeline):
    async def initialize(self, **params):
        # Load your model here
        # Use asyncio.to_thread() for blocking model load operations
        self.model = await asyncio.to_thread(load_my_model, params)

    async def put_video_frame(self, frame: VideoFrame, request_id: str):
        # Process one frame - this is called continuously
        result = await asyncio.to_thread(self.model.predict, frame.tensor)
        await self.frame_queue.put(VideoOutput(result, request_id))

    async def get_processed_video_frame(self) -> VideoOutput:
        return await self.frame_queue.get()

    async def update_params(self, **params):
        # Called when pipeline parameters are updated mid-stream
        pass

    async def stop(self):
        # Clean shutdown
        pass

    @classmethod
    def prepare_models(cls):
        # Download/compile models — called once during operator setup
        snapshot_download("my-org/my-model", ...)
Source: ai-runner Pipeline interface · scope-runner reference implementation

Integration requirements for custom pipelines

Custom live-video pipelines require two code changes in the upstream repositories:
  1. ai-worker/runner/dl_checkpoints.sh — add your pipeline to the model download script
  2. go-livepeer/ai/worker/docker.go — add your pipeline to the container image map (livePipelineToImage)
These changes are required because the current pipeline registry still uses a static mapping. See the full custom pipeline development guide in the ai-runner.md documentation.
The Livepeer team is working toward a fully dynamic plugin architecture that eliminates these manual upstream changes. Track progress on the ai-worker GitHub repository.

Model types for live-video inference

StreamDiffusion (primary)

Best for: Continuous style application, generative video effects, live prompt-to-video
  • lcm-lora variants for fastest inference
  • SD 1.5 base with Lightning LoRA
  • SDXL Turbo at reduced resolution

ControlNet variants

ControlNet conditioning allows style transfer guided by structure maps extracted from the input frame: Source: DepthAnything on HuggingFace · DWPose · ControlNet paper

IP-Adapter (style reference)

IP-Adapter conditions generation on a reference image, enabling consistent style application across frames. Effective for brand-consistent visual transformation. Source: tencent-ailab/IP-Adapter on GitHub

Performance tuning

Maximising fps

Cascade performance is dominated by inference latency per frame. Key levers: Model selection: Use 1–2 step LCM or Lightning models instead of 20-step DDIM. The quality difference for streaming is acceptable; the latency difference is not. Resolution: Lower resolution dramatically increases fps. 512×512 at 30 fps is achievable on an RTX 4090 with StreamDiffusion. 768×768 drops to ~20 fps. 1024×1024 drops to ~12 fps. TensorRT compilation: For production deployments, compile the model to TensorRT engine format. One-time compilation overhead; 2–4× runtime speedup.
TensorRT compilation example
# In your Pipeline.prepare_models():
from torch2trt import torch2trt

model_trt = torch2trt(model, [example_input], fp16_mode=True)
torch.save(model_trt.state_dict(), 'model_trt.pth')
Stochastic Similarity Filter: Enable in StreamDiffusion to skip inference on static or slow-moving frames. This usually lifts fps with no visible quality loss for typical video content.

VRAM management

Unlike batch AI, live-video pipelines hold models in VRAM continuously for the duration of the stream. VRAM must be reserved for:
  • Model weights (~8–18 GB for SDXL-class)
  • Frame buffers (input + output, ~500 MB–1 GB per resolution)
  • ControlNet/LoRA adapters (~1–3 GB each)
  • Stream batch buffer (StreamDiffusion’s continuous frame queue)
On a 24 GB GPU, keep total loaded model + adapter weight below 20 GB to leave headroom for frame buffers and batch operations.

Multi-stream capacity

A single RTX 4090 usually handles 1–2 concurrent live-video streams depending on resolution and model complexity. For multi-stream capacity:
  • Multiple GPUs: The AI worker dispatches streams across multiple physical GPUs
  • Model parallelism: ComfyStream uses one GPU per stream
  • Scale-out: Run multiple orchestrator instances, each handling 1–2 streams, behind a gateway load balancer
Enterprise operators running 4090/A100 fleets monetise live-video AI at scale by advertising higher capacity values and maintaining multiple warm pipeline instances.

Watch: Cascade and live-video AI

Troubleshooting

  1. Model is too slow for target fps — try a lower-step model (LCM, Lightning) or reduce output resolution
  2. VRAM OOM on frame buffer — reduce stream_batch_size in StreamDiffusion config
  3. CPU bottleneck on encode/decode — WebRTC frame codec operations are CPU-bound; monitor CPU usage during streaming
  4. Network jitter — WebRTC is sensitive to packet loss; check your upstream network quality
  1. Confirm live-video-to-video appears on tools.livepeer.cloud/ai/network-capabilities under your orchestrator
  2. Verify the live-base container is running and healthy: docker ps --filter name=livepeer-ai-runner-live
  3. Check that your orchestrator’s serviceAddr is reachable from gateways — WebRTC ICE negotiation requires bidirectional reachability
  4. Confirm your node has WebRTC port access (typically UDP 8935 or your configured port)
  1. Check model weights are present at the expected path (-aiModelsDir location)
  2. Check CUDA/driver compatibility — ComfyStream requires CUDA 12.0+
  3. Run the container manually to see startup output:
Check CUDA access inside the live-base container
docker run --gpus all --rm livepeer/ai-runner:live-base python -c "import torch; print(torch.cuda.is_available())"
This confirms CUDA is accessible inside the container.
  1. Verify livePipelineToImage in go-livepeer/ai/worker/docker.go includes your pipeline name
  2. Confirm dl_checkpoints.sh in ai-runner includes your pipeline’s model preparation step
  3. The model_id in aiModels.json must match the name field in your PipelineSpec exactly
  4. After rebuilding and redeploying, re-register capabilities with the network
Last modified on March 16, 2026