aiModels.json, choosing and loading models, connecting external runners where needed, setting prices, and checking health and routing once the worker is live.
Use this guide once your orchestrator node is already running and connected to the network. Nodes still in initial setup should start with Run an Orchestrator.
Prerequisites
Before configuring AI pipelines, ensure:- go-livepeer is running with the
-aiWorkerflag enabled - NVIDIA Container Toolkit is installed and working (
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi) - Docker is running with GPU access
- You have a
~/.lpData/aiModels.jsonfile or know where you want to create one
How the AI worker runs pipelines
When go-livepeer starts with-aiWorker, it reads aiModels.json and starts Docker containers for each configured pipeline:
livepeer/ai-runner. Except for the llm pipeline — which uses a separate Ollama-based runner — all batch pipelines use this image. The AI worker manages container lifecycle: starting, health-checking, and restarting containers automatically.
aiModels.json — full reference
aiModels.json is the single file that controls everything about your AI worker: which pipelines you run, which models you load, whether they stay warm in VRAM, and how you price each job.
Default location: ~/.lpData/aiModels.json
Override location: set with -aiModels flag at startup
Minimal working example
Minimal aiModels.json example
text-to-image pipeline with a competitive warm model.
Complete field reference
Supported pipelines and recommended models
text-to-image
Generate images from text prompts. The highest-demand pipeline on the network.text-to-image aiModels.json entry
ByteDance/SDXL-Lightning— similar performance, different basestabilityai/stable-diffusion-xl-base-1.0— higher quality, slower
image-to-image
Apply diffusion-based transformations, style transfer, or enhancement to an input image.image-to-image aiModels.json entry
text-to-image when cold-loading is acceptable.
image-to-video
Animate a still image into a short video clip. Compute-intensive — expect longer per-job times.image-to-video aiModels.json entry
image-to-text
Generate text descriptions of images. Accessible to operators with older or lower-end GPUs.image-to-text aiModels.json entry
image-to-text and audio-to-text.
Source: Salesforce/blip-image-captioning-large
audio-to-text
Speech recognition and transcription with timestamps. Backed by Whisper-large-v3.audio-to-text aiModels.json entry
openai/whisper-large-v3 is the current network standard for accuracy and is the model most gateway operators request.
Source: openai/whisper-large-v3
segment-anything-2
Promptable segmentation — returns pixel masks for objects or regions in an image or video frame.segment-anything-2 aiModels.json entry
upscale
Upscale low-resolution images to high resolution using diffusion-based super-resolution.upscale aiModels.json entry
text-to-speech
Text-to-natural-speech synthesis. Growing use case for AI video narration.text-to-speech aiModels.json entry
LLM inference — the Ollama runner
Thellm pipeline uses a different architecture from all other batch pipelines. Instead of the standard livepeer/ai-runner container, it uses an Ollama-based runner maintained by Cloud SPE. This enables quantised LLMs to run on GPUs with as little as 8 GB VRAM.
LLM pipeline flow
Why Ollama?
Standard diffusion pipelines require 24 GB VRAM and server-class GPUs. The Ollama runner opens participation to older consumer GPUs (GTX 1080, RTX 2060) that would otherwise contribute nothing to the AI network. Quantised LLMs — especially 7B and 8B parameter models — run efficiently within 8–12 GB VRAM. Source: tztcloud/livepeer-ollama-runner on Docker Hub · Cloud SPE LLM pipeline guideSetup
Supported models via Ollama (at time of writing):Warm vs cold models
Warm: The model is preloaded into GPU VRAM at container startup. Any job request is served immediately — no model loading latency. Cold: The model is loaded on first request. The container exists while the weights stay on disk until the first request triggers a model load, typically 10–60 seconds depending on model size and NVMe speed.Impact on job assignment
Gateways track orchestrator latency. Nodes with fast first-response times win more jobs. For latency-sensitive pipelines — especiallytext-to-image and image-to-image — running cold puts you at a clear competitive disadvantage.
Rule of thumb: Warm your primary revenue pipeline. Cold the rest.
VRAM planning for warm models
A 24 GB GPU supports one large diffusion model warm, or a combination of smaller pipelines simultaneously. See Model Hosting and VRAM Planning for multi-model patterns.Optimisation flags
optimization_flags apply only to warm: true diffusion models (text-to-image, image-to-image, upscale). Both flags are experimental. Primary references: Stable Fast and DeepCache.
SFAST — Stable Fast (up to 25% faster)
SFAST — Stable Fast (up to 25% faster)
Enables the Stable Fast optimisation framework. Compiles the diffusion model’s compute graph on first run to eliminate redundant operations.Best for: High-throughput operators with frequent repeated requests on the same model.Source: chengzeyi/stable-fast on GitHub
- Speedup: Up to 25% faster inference
- Quality impact: None
- Tradeoff: First inference is slower (compilation overhead). Subsequent runs are faster.
SFAST optimization flag
DEEPCACHE — Deferred computation (up to 50% faster)
DEEPCACHE — Deferred computation (up to 50% faster)
Caches intermediate diffusion steps to reduce redundant recomputation across inference calls.Skip Lightning and Turbo models here. These models are already step-optimised for 1–4 inference steps. Applying DEEPCACHE to them degrades output quality without a clear speed benefit.Source: DeepCache paper and implementation
- Speedup: Up to 50% faster inference
- Quality impact: Minor (slight reduction in fine detail at high step counts)
- Tradeoff: Quality degradation is more noticeable at low step counts.
DEEPCACHE optimization flag
Running multiple pipelines
A complete multi-pipelineaiModels.json for a node with one RTX 4090 (24 GB) and one RTX 2060 (8 GB):
Multi-pipeline aiModels.json example
- RTX 4090:
text-to-imagewarm.image-to-imageloads cold on demand. - RTX 2060:
audio-to-textandllmwarm (both are low-VRAM pipelines that fit within 8 GB).
BYOC external containers
Theurl field in any aiModels.json entry points to an external container that handles inference for that pipeline. The AI worker passes jobs through and polls the container’s /health endpoint at startup.
BYOC audio-to-text aiModels.json entry
capacity sets how many concurrent jobs the external container handles. Set it from the container’s actual concurrency support. Default is 1.
External containers must:
- Expose a
/healthendpoint that returns HTTP 200 - Handle inference requests in the format the AI worker sends (same contract as
livepeer/ai-runner)
- Ollama runner (as above)
- Custom PyTorch / TensorRT / ONNX inference servers
- K8s clusters or GPU farms behind a load balancer
- Auto-scaling stacks (Docker Swarm, Nomad, Podman)
Pricing
AI inference pricing on Livepeer is set by operators and advertised on-chain. Gateways filter bymaxPricePerUnit — jobs only reach orchestrators whose price falls below the gateway’s maximum.
Pricing units by pipeline
Setting competitive prices
Wei AI pricing example
4768371 Wei is approximately 0.0005 USD per megapixel at ETH/USD rates from late 2025. To express prices directly in USD:
USD AI pricing example
Monitoring your pipelines
Check container health:List AI runner containers
Up status. Containers in a restart loop need an immediate log check:
Inspect AI runner logs
Troubleshooting
Primary NVIDIA toolkit reference for this section: NVIDIA Container Toolkit install guide.Start the AI runner container
Start the AI runner container
Most common causes:
- Wrong image tag — verify the
livepeer/ai-runnerimage tag exists on Docker Hub. The-aiRunnerImageflag is deprecated; use-aiRunnerImageOverridesinstead. - VRAM OOM — the container starts, then crashes immediately after loading because
warm: trueexceeds available VRAM. Checkdocker logs <container_name>for OOM messages. - NVIDIA Container Toolkit missing or misconfigured — run
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi. A passing result confirms the toolkit path. The installation guide is here as a primary reference: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
Fix model ID loading errors
Fix model ID loading errors
model_id must match the HuggingFace model ID exactly, including capitalisation and the / separator. Common mistakes:- Lowercase when the actual ID is mixed case
- Missing the organisation prefix (
RealVisXL_V4.0_Lightninginstead ofSG161222/RealVisXL_V4.0_Lightning) - Using an Ollama tag (
llama3.1:8b) directly asmodel_idinstead of the HuggingFace ID
/health at startup. A model that is still downloading fails the health check and the entry is skipped.Pipeline receiving no jobs
Pipeline receiving no jobs
- Registration missing — confirm your capabilities appear on tools.livepeer.cloud/ai/network-capabilities. Missing entries usually mean the orchestrator needs to re-register capabilities after updating
aiModels.json. - Price too high — gateways don’t route to orchestrators above their
maxPricePerUnit. Compare your price against active competitors on Livepeer Explorer. - Model is cold — for competitive pipelines like
text-to-image, setwarm: true. - Active-set gap — check your stake status on explorer.livepeer.org. AI pipeline jobs require the orchestrator to be in the active set.
OOM during inference
OOM during inference
The model loaded successfully but a specific request causes an out-of-memory error mid-run. Happens when a request asks for unusually large output dimensions (e.g.
text-to-image at 2048×2048 on a 24 GB GPU).Mitigations:- Reduce
maxSessionson your AI worker to limit concurrent jobs - Set
"capacity": 1in the affectedaiModels.jsonentry - Consider DEEPCACHE or SFAST to reduce peak VRAM usage (diffusion pipelines only)
Restore Ollama LLM job flow
Restore Ollama LLM job flow
- Verify container reachability: from the host running your orchestrator, run
curl http://llm_runner:8000/health— should return HTTP 200 - Check Docker network: the orchestrator and
llm_runnercontainer must share a Docker network for the hostname to resolve - Re-register capabilities with the network after updating
aiModels.json - Confirm on tools.livepeer.cloud/ai/network-capabilities that your orchestrator appears under the
llmpipeline
SFAST causing first-request latency
SFAST causing first-request latency
This is expected behaviour. SFAST compiles the model graph on the first inference call, which takes longer than normal. Subsequent calls benefit from the compiled graph. First-request job failures call for disabling SFAST and relying on native diffusion speed.
Watch: Batch AI on Livepeer
Canonical references for pipeline and model decisions
When configuringaiModels.json, two external references are authoritative:
For supported models and pipeline compatibility: The AI Model Support page in the Developers section lists every pipeline type, supported model architectures, minimum VRAM, and current network status. This is the single source of truth for “will this model work on the network?” Use it before experimenting with untested model IDs.
For understanding how gateways select your node: The Orchestrator Offerings reference documents the capability discovery protocol — specifically the capabilities_prices field structure and how gateways evaluate your node against their -maxPricePerCapability configuration. Before setting prices, confirm your prices fall within ranges that major gateways will accept.
For custom models outside the standard pipeline list: Bring Your Own Container (BYOC) covers building a custom Docker container with PyTrickle integration to run any model on the network. BYOC is the path for proprietary models, fine-tuned checkpoints, or models with non-standard inference architectures.
Related
AI Model Support
Canonical list of supported pipelines, model architectures, VRAM requirements, and network status. The authoritative reference before adding a new model.
Bring Your Own Container (BYOC)
Run any custom model on the network using PyTrickle — for models not covered by the standard AI runner containers.
Orchestrator Offerings Reference
How gateways discover orchestrators and evaluate capability/pricing — the selection algorithm that determines whether your node receives jobs.
Model Hosting and VRAM Planning
VRAM table by pipeline, warm model strategy, and multi-GPU configuration.
Cascade Setup
Deploy live-video-to-video pipelines for live streaming AI effects.
AI Workloads Overview
Pipeline types, batch vs live-video AI, and how jobs flow to your node.