Scaling & Resource Management

A Gateway routes jobs to Orchestrators instead of processing them locally, but it still has resource limits: session capacity, network throughput, and (for Dual Gateways) the GPU resources shared between video and AI workloads.
how to measure those limits, decide when they have been reached, and scale in the right direction.

Scaling Signals

Before adjusting anything, confirm there is actually a capacity problem. These are the reliable indicators.

Session limit reached

The Prometheus metric livepeer_current_sessions_total approaching livepeer_max_sessions_total means the Gateway is at capacity and will start rejecting new sessions. This is the clearest signal.

curl http://localhost:8935/metrics | grep livepeer_current_sessions

GPU memory pressure

For Dual Gateways, the /hardware/stats endpoint reports GPU utilisation and memory:

curl http://localhost:8935/hardware/stats

If VRAM usage is consistently above 85%, the Gateway is at risk of OOM failures on new model loads (AI) or performance degradation on concurrent transcoding segments (video).

Increasing latency under load

Rising transcoding latency or AI inference latency under load, combined with high Orchestrator swap rates, suggests the Orchestrators being routed to are also under pressure. This is an Orchestrator-side scaling problem. See for expanding the Orchestrator pool.

Client rejections

Errors with OrchestratorCapped in the Gateway log indicate that a downstream Orchestrator has hit its own session limit and rejected the job. Either expand the Orchestrator pool or negotiate higher capacity with preferred Orchestrators.

Capacity Planning

Video transcoding

Per GPU, the practical limit for 1080p to 720p/480p/360p transcoding is approximately 8-12 concurrent sessions on a modern NVIDIA T4, and 15-25 on an RTX 3080 or equivalent. Run livepeer_bench on the specific hardware to get a precise number.Multiply by the number of GPUs in the Orchestrator pool to get total network capacity. The Gateway can route to all of them.For deposit sizing: video transcoding payments are per-pixel. Estimate the expected pixel throughput (resolution x frame rate x concurrent sessions x hours per day) and size the Arbitrum deposit to cover at least 24 hours of expected traffic.

AI inference

AI inference capacity depends entirely on the model and the Orchestrators in the pool. FLUX.1-dev requires approximately 12-16 GB VRAM per concurrent inference; smaller SD 1.5 models can fit in 6 GB.For off-chain AI Gateways, the bottleneck is the number of AI-capable Orchestrators available. Expand the -orchAddr list or use a discovery URL that returns a larger pool.Set a monitoring alert at 70% of observed capacity. This gives time to provision additional Orchestrators before hitting the ceiling.

Alert thresholds

Orchestrator Selection

Expand the Orchestrator pool to increase available capacity.

Gateway Middleware

Route requests across scaled instances with middleware.

GPU Support

Tested GPU models and session limit data.

Monitoring Setup

Prometheus metrics and dashboards for scaling signals.

Last modified on May 4, 2026

Orchestrator Selection & Tiering

Gateway Middleware

⌘I

​Scaling Signals

​Capacity Planning

​Related Pages

Orchestrator Selection

Gateway Middleware

GPU Support

Monitoring Setup

Scaling Signals

Capacity Planning

Related Pages