Documentation Index
Fetch the complete documentation index at: https://docs.livepeer.org/llms.txt
Use this file to discover all available pages before exploring further.
By the end of this tutorial you’ll have a working VTuber pipeline: webcam input flows into a ComfyStream workflow that extracts pose keypoints, conditions a StreamDiffusion model on those keypoints, and emits an animated avatar matching your pose in realtime. The output streams back to the browser via WebRTC. Latency budget is under 100ms end-to-end on adequate hardware. This is the Persona 1 and Persona 4 join: the agent builder who needs a face for their AI character, and the live-video builder who needs sub-second-latency video transformation. The Agent SPE built the first production avatar pipeline on this stack; the recipe below is what you replicate when you build the second.
Required Tools
- A working ComfyStream instance from the : RunPod, Docker, or local
- NVIDIA GPU with 16 GB+ VRAM. RTX 3090 minimum; RTX 4090 recommended for 25 FPS
- ComfyUI with the workflow editor accessible (port 8188 on a running ComfyStream container)
- A webcam, or an RTMP source publishing to the gateway
Pipeline Shape
A VTuber pipeline is a chain of four nodes inside one ComfyStream workflow:| Conditioning | Effect | When to use |
|---|---|---|
| DWPose | Pose-only, full character replacement | VTuber avatars, character swaps |
| Depth (DepthAnything / MiDaS) | Preserves 3D structure | Stylised camera subject, scene transformation |
| Canny / Sketch | Preserves edges | Line-art or comic effects |
Workflow Authoring
Open ComfyUI
With the ComfyStream container running, open the ComfyUI editor at
http://localhost:8188. The blank workflow canvas appears.Add the input and output nodes
Right-click the canvas, search and add:
LoadTensor: pulls the live video frameSaveTensor: publishes the processed frame back
LoadTensor.image → (later, the model output) → SaveTensor.image.Add pose extraction
Add
DWPose Estimator (or OpenPose Estimator if DWPose is not available in your ComfyStream image). Connect LoadTensor.image → DWPose.image. The node outputs a pose_keypoints map and a rendered pose image.DWPose is in ComfyUI-controlnet-aux, which ships in livepeer/comfystream by default.Add StreamDiffusion with pose conditioning
Add three nodes:
StreamDiffusionCheckpoint: loads the base diffusion modelStreamDiffusionConfig: sets CFG, t-index, acceleration modeStreamDiffusionSampler: runs inference per frame
DWPose.pose_image → StreamDiffusionSampler.control_image. Set control_type to openpose in the sampler config.Set the avatar prompt
Add a Add a negative prompt to suppress artefacts:Connect the negative encode to
CLIPTextEncode node and connect it to StreamDiffusionSampler.positive. Enter a prompt that describes the avatar:StreamDiffusionSampler.negative.First Run
Load the workflow in ComfyStream
Switch to the ComfyStream UI on port 8889. Pick your new workflow from the selector. First run triggers TensorRT compilation: 2-10 minutes depending on model size and GPU.
Connect your webcam
Pick your webcam from the camera input dropdown. The browser requests camera permission.
Latency Tuning
VTuber latency is the difference between your physical movement and the avatar’s responding movement. Under 100ms reads as instant; over 200ms reads as broken. Three levers move the number. Model step count. A 25-step DDIM run is unusable in realtime. LCM (1-2 steps) and Lightning (4 steps) variants are the realistic options. The visual quality difference at 512x512 is acceptable; the latency difference is not. Resolution. 512x512 at 30 FPS is achievable on an RTX 4090 with StreamDiffusion. 768x768 drops to ~20 FPS. 1024x1024 drops to ~12 FPS. Most VTuber output destinations (Twitch, YouTube Live) downscale anyway, so 512x512 source is usually correct. TensorRT compilation. Once per deployment, compile the model to a TensorRT engine. Runtime speedup is 2-4x with no visual quality cost. ComfyStream handles compilation automatically on first workflow load. Stochastic Similarity Filter. StreamDiffusion’s SSF skips inference on near-identical frames. For VTuber content where the streamer often sits relatively still, SSF lifts effective FPS by 30-50% with no quality drop. Enable inStreamDiffusionConfig.enable_similar_image_filter = true.
Agent-Controlled Avatars
The avatar so far mirrors your pose. To drive the avatar from an AI agent (text in, motion and speech out), three additions wire on top: LLM for character voice. Route agent text generation through the . The chatbot tutorial covers the wire format. Text-to-speech for character voice. Route the LLM output through thetext-to-speech batch pipeline (parler-tts/parler-tts-large-v1). The TTS warm model accepts a description parameter for voice characteristics.
Lip-sync conditioning. For lip-synced output, add an audio-driven pose node that maps the TTS audio to lip and jaw keypoints. The Agent SPE production pipeline uses this approach; the exact node varies by model. As of Phase 4 the Daydream stack uses a custom audio-to-pose adapter not yet released as an open custom node.
For an end-to-end agent + avatar setup, see the ; Eliza handles character files, RAG, and multi-agent orchestration.
Production Considerations
Local execution proves the pipeline. Production shipping needs three more layers. Dedicated GPU per stream. Real-time AI assigns a GPU to a stream for its full duration. A four-viewer concurrent test needs four GPUs in your orchestrator pool, or four orchestrators with one GPU each. Test under expected concurrency; orchestrator availability forlive-video-to-video is lower than for batch pipelines at peak network load.
Gateway routing. Self-hosted gateways pick the orchestrator running your ComfyStream image. Production deployments either run a gateway pool or use a paid gateway provider with BYOC routing enabled.
Stream out to viewers. ComfyStream emits WebRTC. For broadcast to Twitch, YouTube Live, or a player, restream through OBS, a media server (e.g. MediaMTX), or a dedicated forwarder. Direct WebRTC-to-RTMP bridges are available; ComfyStream itself does not re-encode for those targets.
Full real-time hardening guidance in (Orchestrators tab; same primitives apply to single-user deployments).
Common Errors
DWPose node not found
DWPose node not found
ComfyUI-controlnet-aux is missing from the ComfyStream image you’re using. The official livepeer/comfystream:latest includes it. For custom builds, install with pip install -r requirements.txt from the ComfyUI-controlnet-aux repo and restart the server.Output avatar drifts off-pose by 200ms
Output avatar drifts off-pose by 200ms
Pose extraction latency added to inference latency. Drop to a smaller pose model (
DWPose-S instead of DWPose-L), or use OpenPose if DWPose latency dominates. Confirm with nvidia-smi that GPU utilisation is high; low utilisation means a CPU bottleneck on pose extraction.Output avatar identity shifts frame-to-frame
Output avatar identity shifts frame-to-frame
Without IP-Adapter or LoRA, StreamDiffusion’s character identity drifts because each frame is sampled independently. Add an IP-Adapter node with a reference image of the target avatar, or train a character LoRA and load it via
StreamDiffusionConfig.loras.FPS stable but choppy playback
FPS stable but choppy playback
Network jitter or browser-side rendering. WebRTC is sensitive to packet loss; on a remote ComfyStream instance, confirm the UDP port range 1024-65535 is open both ways. For local instances with choppy playback, the issue is usually the browser; test in Chrome before debugging the pipeline.
OOM on first compilation
OOM on first compilation
TensorRT compilation allocates extra VRAM temporarily. A 16 GB GPU running a workflow that fits in 14 GB at runtime may OOM at compile. Either compile on a larger GPU and copy the engine, or drop to a smaller base model.
AI agent prompt
Next Steps
Workflow Authoring
Deep guide on ComfyStream workflow construction, control flow, custom nodes.
Eliza Plugin Tutorial
Agent character files, RAG, multi-agent orchestration.
LLM Chatbot Tutorial
Wire LLM-driven dialogue into the avatar.
Realtime AI Setup (Operator)
Orchestrator-side primitives, capacity planning, GPU selection.