This guide covers how to measure those limits, decide when they have been reached, and scale in the right direction.
Scaling Signals
Before adjusting anything, confirm there is actually a capacity problem. These are the reliable indicators.Session limit reached
Session limit reached
The Prometheus metric
livepeer_current_sessions_total approaching livepeer_max_sessions_total means the Gateway is at capacity and will start rejecting new sessions. This is the clearest signal.GPU memory pressure
GPU memory pressure
For Dual Gateways, the If VRAM usage is consistently above 85%, the Gateway is at risk of OOM failures on new model loads (AI) or performance degradation on concurrent transcoding segments (video).
/hardware/stats endpoint reports GPU utilisation and memory:Increasing latency under load
Increasing latency under load
Rising transcoding latency or AI inference latency under load, combined with high Orchestrator swap rates, suggests the Orchestrators being routed to are also under pressure. This is an Orchestrator-side scaling problem. See for expanding the Orchestrator pool.
Client rejections
Client rejections
Errors with
OrchestratorCapped in the Gateway log indicate that a downstream Orchestrator has hit its own session limit and rejected the job. Either expand the Orchestrator pool or negotiate higher capacity with preferred Orchestrators.Capacity Planning
Video transcoding
Video transcoding
Per GPU, the practical limit for 1080p to 720p/480p/360p transcoding is approximately 8-12 concurrent sessions on a modern NVIDIA T4, and 15-25 on an RTX 3080 or equivalent. Run
livepeer_bench on the specific hardware to get a precise number.Multiply by the number of GPUs in the Orchestrator pool to get total network capacity. The Gateway can route to all of them.For deposit sizing: video transcoding payments are per-pixel. Estimate the expected pixel throughput (resolution x frame rate x concurrent sessions x hours per day) and size the Arbitrum deposit to cover at least 24 hours of expected traffic.AI inference
AI inference
AI inference capacity depends entirely on the model and the Orchestrators in the pool. FLUX.1-dev requires approximately 12-16 GB VRAM per concurrent inference; smaller SD 1.5 models can fit in 6 GB.For off-chain AI Gateways, the bottleneck is the number of AI-capable Orchestrators available. Expand the
-orchAddr list or use a discovery URL that returns a larger pool.Set a monitoring alert at 70% of observed capacity. This gives time to provision additional Orchestrators before hitting the ceiling.Alert thresholds
Alert thresholds