AI Model Management - Livepeer Docs

Warm the model Gateways request most often on the pipeline with the highest demand. Check tools.Livepeer.cloud weekly – demand shifts as new models are listed and Gateways update their routing preferences.

AI model management covers the operational decisions made after models are downloaded: which models to keep warm in VRAM, how to allocate VRAM across multiple pipelines, when to rotate warm models based on demand changes, and which optimisation flags to apply for throughput gains. Model sourcing and downloading is covered separately in .

Warm vs cold strategy

Warm means the model weights are loaded into GPU VRAM at container startup. Job requests are served immediately with no loading latency. Cold keeps the container available while leaving the weights out of VRAM. The first request triggers a model load – typically 10 to 60 seconds depending on model size and NVMe storage speed – before inference begins.

Impact on job routing

Gateways track first-response latency per Orchestrator. Nodes with fast first responses win more jobs. For latency-sensitive pipelines – particularly text-to-image and image-to-image – cold loading creates a competitive disadvantage on the first request of each session. The practical rule: warm your primary revenue pipeline. Keep secondary pipelines cold until VRAM capacity allows warming them.

Beta constraint: one warm model per GPU

During the Beta phase, only one warm model per GPU is supported. Setting "warm": true on more entries than you have GPUs causes the AI worker to log a conflict at startup and skip the excess entries. Check logs on startup for:

Check warm model startup logs

docker logs <ai-runner-container> 2>&1 | grep -i "warm\|conflict\|error"

The Error loading warm model message indicates a warm model conflict. Reduce "warm": true entries to match your GPU count.

VRAM allocation

A 24 GB GPU holds one large diffusion model warm, with a small pipeline (Whisper or BLIP) warm on the same card when using a multi-GPU system. Keep multiple large diffusion models off a single 24 GB GPU – the Beta constraint blocks it, and VRAM is insufficient anyway.

Model rotation by demand

Demand on the Livepeer AI network shifts over time. A model leading one week often falls back the next as new models are listed or Gateway preferences change. Warm model selection should track demand and revenue opportunity.

Checking current demand

Visit tools.Livepeer.cloud/ai/network-capabilities weekly. Filter by pipeline to see which models active Gateways are requesting. Models with the most Gateway registrations are receiving the most routing traffic. The Livepeer Explorer AI leaderboard shows per-Orchestrator earnings data, which reveals which price tiers and pipelines are earning the most jobs.

Rotating the warm model

To swap which model is warm, update aiModels.json and restart the AI worker:

aiModels.json — rotate warm model

[
  {
    "pipeline": "text-to-image",
    "model_id": "SG161222/RealVisXL_V4.0_Lightning",
    "price_per_unit": 4768371,
    "warm": true
  },
  {
    "pipeline": "text-to-image",
    "model_id": "ByteDance/SDXL-Lightning",
    "price_per_unit": 4768371,
    "warm": false
  }
]

The model with "warm": true loads at startup. The cold entry is available but will incur first-request latency. After restarting the AI worker container, verify the new warm model appears with Warm status at tools.Livepeer.cloud/ai/network-capabilities. Restart the AI worker container and leave the full go-livepeer process running to minimise downtime:

Restart the AI worker

docker restart <ai-runner-container-name>

Optimisation flags

Optimisation flags apply to warm diffusion models only – text-to-image, image-to-image, and upscale entries with "warm": true. They have no effect on cold models or on non-diffusion pipelines. Both flags are experimental. Apply one at a time and verify output quality before serving jobs. SFAST and DEEPCACHE cannot be combined. Set only one, or neither, in optimization_flags.

Reserve DEEPCACHE for standard-step models. Lightning and Turbo variants (for example ByteDance/SDXL-Lightning and SG161222/RealVisXL_V4.0_Lightning) operate at 1 to 4 inference steps, so DEEPCACHE degrades output quality without adding speed.

Monitoring model loading

Verify model state after startup by checking the AI Runner container logs:

Check model loading logs

docker logs <ai-runner-container> 2>&1 | grep -E "warm|loaded|error" | tail -30

Then confirm via the network registry. Your Orchestrator should appear under the correct pipeline with Warm status: tools.Livepeer.cloud/ai/network-capabilities A warm model missing from the registry after 5 minutes usually indicates one of these causes:

Error loading warm model – warm model conflict (too many "warm": true entries for available GPUs)
Container restart loops – check docker ps for restart counts
Model download still in progress – warm models must finish downloading before the container loads them

Model Hosting

Download mechanics, storage layout, HuggingFace sourcing, and gated model access.

AI Inference Operations

Full aiModels.json reference and pipeline architecture.

Model Demand Reference

VRAM requirements and demand context for all supported pipelines.

Capacity Planning

VRAM budgeting and the capacity field for AI pipeline concurrency.

​Warm vs cold strategy

​Impact on job routing

​Beta constraint: one warm model per GPU

​VRAM allocation

​Model rotation by demand

​Checking current demand

​Rotating the warm model

​Optimisation flags

​Monitoring model loading

​Related pages

Model Hosting

AI Inference Operations

Model Demand Reference

Capacity Planning

Warm vs cold strategy

Impact on job routing

Beta constraint: one warm model per GPU

VRAM allocation

Model rotation by demand

Checking current demand

Rotating the warm model

Optimisation flags

Monitoring model loading

Related pages