Skip to main content
Batch AI inference is the most accessible entry point to the Livepeer AI network. An application sends a request — a text prompt, an image, an audio file — your node processes it and returns the result. You earn per-unit fees for every successful job. Run batch AI by configuring aiModels.json, choosing and loading models, connecting external runners where needed, setting prices, and checking health and routing once the worker is live.
Use this guide once your orchestrator node is already running and connected to the network. Nodes still in initial setup should start with Run an Orchestrator.

Prerequisites

Before configuring AI pipelines, ensure:
  • go-livepeer is running with the -aiWorker flag enabled
  • NVIDIA Container Toolkit is installed and working (docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi)
  • Docker is running with GPU access
  • You have a ~/.lpData/aiModels.json file or know where you want to create one

How the AI worker runs pipelines

When go-livepeer starts with -aiWorker, it reads aiModels.json and starts Docker containers for each configured pipeline:
go-livepeer
    ↓ reads aiModels.json
    ↓ pulls livepeer/ai-runner containers
GPU containers start per pipeline entry
    ↓ each container loads its model
AI worker advertises capabilities to network

Gateways start routing matching jobs
The standard container image is livepeer/ai-runner. Except for the llm pipeline — which uses a separate Ollama-based runner — all batch pipelines use this image. The AI worker manages container lifecycle: starting, health-checking, and restarting containers automatically.

aiModels.json — full reference

aiModels.json is the single file that controls everything about your AI worker: which pipelines you run, which models you load, whether they stay warm in VRAM, and how you price each job. Default location: ~/.lpData/aiModels.json Override location: set with -aiModels flag at startup

Minimal working example

Minimal aiModels.json example
[
  {
    "pipeline": "text-to-image",
    "model_id": "SG161222/RealVisXL_V4.0_Lightning",
    "price_per_unit": 4768371,
    "warm": true
  }
]
This single entry is enough to start earning from the text-to-image pipeline with a competitive warm model.

Complete field reference

model_id must match the HuggingFace model ID exactly, including capitalisation and the organisation prefix. A typo here will cause the container to fail at model load time. During Beta, only one warm model per GPU is supported — setting warm: true on more entries than you have GPUs will cause a conflict at startup.

text-to-image

Generate images from text prompts. The highest-demand pipeline on the network.
text-to-image aiModels.json entry
{
  "pipeline": "text-to-image",
  "model_id": "SG161222/RealVisXL_V4.0_Lightning",
  "price_per_unit": 4768371,
  "warm": true
}
VRAM requirement: 24 GB Pricing unit: Per output pixel Why Lightning? SDXL-Lightning reduces inference to 4 steps (vs 20–50 for standard SDXL), delivering results in under 2 seconds on an RTX 4090. Gateways and users strongly prefer fast models for this pipeline. Alternative models:
  • ByteDance/SDXL-Lightning — similar performance, different base
  • stabilityai/stable-diffusion-xl-base-1.0 — higher quality, slower
Source: SG161222/RealVisXL_V4.0_Lightning · ByteDance/SDXL-Lightning

image-to-image

Apply diffusion-based transformations, style transfer, or enhancement to an input image.
image-to-image aiModels.json entry
{
  "pipeline": "image-to-image",
  "model_id": "ByteDance/SDXL-Lightning",
  "price_per_unit": 4768371
}
VRAM requirement: 24 GB Pricing unit: Per output pixel Note: This fits on the same 24 GB GPU as text-to-image when cold-loading is acceptable.

image-to-video

Animate a still image into a short video clip. Compute-intensive — expect longer per-job times.
image-to-video aiModels.json entry
{
  "pipeline": "image-to-video",
  "model_id": "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
  "price_per_unit": 9536742
}
VRAM requirement: 24 GB (32 GB or multi-GPU preferred for longer clips) Pricing unit: Per output pixel

image-to-text

Generate text descriptions of images. Accessible to operators with older or lower-end GPUs.
image-to-text aiModels.json entry
{
  "pipeline": "image-to-text",
  "model_id": "Salesforce/blip-image-captioning-large",
  "price_per_unit": 1192093,
  "warm": true
}
VRAM requirement: 4 GB Pricing unit: Per input pixel Why this matters: Operators with 8–12 GB VRAM GPUs still contribute to the network through image-to-text and audio-to-text. Source: Salesforce/blip-image-captioning-large

audio-to-text

Speech recognition and transcription with timestamps. Backed by Whisper-large-v3.
audio-to-text aiModels.json entry
{
  "pipeline": "audio-to-text",
  "model_id": "openai/whisper-large-v3",
  "price_per_unit": 12882811,
  "pixels_per_unit": 1,
  "warm": true
}
VRAM requirement: 12 GB Pricing unit: Per millisecond of audio Model: openai/whisper-large-v3 is the current network standard for accuracy and is the model most gateway operators request. Source: openai/whisper-large-v3

segment-anything-2

Promptable segmentation — returns pixel masks for objects or regions in an image or video frame.
segment-anything-2 aiModels.json entry
{
  "pipeline": "segment-anything-2",
  "model_id": "facebook/sam2-hiera-large",
  "price_per_unit": 4768371
}
VRAM requirement: 12–24 GB depending on model variant Pricing unit: Per input pixel Source: facebookresearch/segment-anything-2

upscale

Upscale low-resolution images to high resolution using diffusion-based super-resolution.
upscale aiModels.json entry
{
  "pipeline": "upscale",
  "model_id": "stabilityai/stable-diffusion-x4-upscaler",
  "price_per_unit": 4768371,
  "warm": true,
  "optimization_flags": {
    "SFAST": true
  }
}
VRAM requirement: 16–24 GB Pricing unit: Per input pixel

text-to-speech

Text-to-natural-speech synthesis. Growing use case for AI video narration.
text-to-speech aiModels.json entry
{
  "pipeline": "text-to-speech",
  "model_id": "suno/bark",
  "price_per_unit": 5960465
}
Pricing unit: Per character or per ms of output audio

LLM inference — the Ollama runner

The llm pipeline uses a different architecture from all other batch pipelines. Instead of the standard livepeer/ai-runner container, it uses an Ollama-based runner maintained by Cloud SPE. This enables quantised LLMs to run on GPUs with as little as 8 GB VRAM.
LLM pipeline flow
go-livepeer -> livepeer-ollama-runner -> ollama container -> quantised model

Why Ollama?

Standard diffusion pipelines require 24 GB VRAM and server-class GPUs. The Ollama runner opens participation to older consumer GPUs (GTX 1080, RTX 2060) that would otherwise contribute nothing to the AI network. Quantised LLMs — especially 7B and 8B parameter models — run efficiently within 8–12 GB VRAM. Source: tztcloud/livepeer-ollama-runner on Docker Hub · Cloud SPE LLM pipeline guide

Setup

Supported models via Ollama (at time of writing):

Warm vs cold models

Warm: The model is preloaded into GPU VRAM at container startup. Any job request is served immediately — no model loading latency. Cold: The model is loaded on first request. The container exists while the weights stay on disk until the first request triggers a model load, typically 10–60 seconds depending on model size and NVMe speed.

Impact on job assignment

Gateways track orchestrator latency. Nodes with fast first-response times win more jobs. For latency-sensitive pipelines — especially text-to-image and image-to-image — running cold puts you at a clear competitive disadvantage. Rule of thumb: Warm your primary revenue pipeline. Cold the rest.
Beta constraint: Only one warm model per GPU is supported during the Beta phase. Setting warm: true on more entries than you have GPUs makes the AI worker log a conflict error at startup and skip the excess entries. Check logs for Error loading warm model to identify conflicts.

VRAM planning for warm models

A 24 GB GPU supports one large diffusion model warm, or a combination of smaller pipelines simultaneously. See Model Hosting and VRAM Planning for multi-model patterns.

Optimisation flags

optimization_flags apply only to warm: true diffusion models (text-to-image, image-to-image, upscale). Both flags are experimental. Primary references: Stable Fast and DeepCache.
Enables the Stable Fast optimisation framework. Compiles the diffusion model’s compute graph on first run to eliminate redundant operations.
  • Speedup: Up to 25% faster inference
  • Quality impact: None
  • Tradeoff: First inference is slower (compilation overhead). Subsequent runs are faster.
SFAST optimization flag
"optimization_flags": { "SFAST": true }
Best for: High-throughput operators with frequent repeated requests on the same model.Source: chengzeyi/stable-fast on GitHub
Caches intermediate diffusion steps to reduce redundant recomputation across inference calls.
  • Speedup: Up to 50% faster inference
  • Quality impact: Minor (slight reduction in fine detail at high step counts)
  • Tradeoff: Quality degradation is more noticeable at low step counts.
DEEPCACHE optimization flag
"optimization_flags": { "DEEPCACHE": true }
Skip Lightning and Turbo models here. These models are already step-optimised for 1–4 inference steps. Applying DEEPCACHE to them degrades output quality without a clear speed benefit.Source: DeepCache paper and implementation
SFAST and DEEPCACHE cannot be combined. Choose one or neither.

Running multiple pipelines

A complete multi-pipeline aiModels.json for a node with one RTX 4090 (24 GB) and one RTX 2060 (8 GB):
Multi-pipeline aiModels.json example
[
  {
    "pipeline": "text-to-image",
    "model_id": "SG161222/RealVisXL_V4.0_Lightning",
    "price_per_unit": 4768371,
    "warm": true,
    "optimization_flags": {
      "SFAST": true
    }
  },
  {
    "pipeline": "image-to-image",
    "model_id": "ByteDance/SDXL-Lightning",
    "price_per_unit": 4768371
  },
  {
    "pipeline": "audio-to-text",
    "model_id": "openai/whisper-large-v3",
    "price_per_unit": 12882811,
    "warm": true
  },
  {
    "pipeline": "llm",
    "model_id": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "warm": true,
    "price_per_unit": 0.18,
    "currency": "USD",
    "pixels_per_unit": 1000000,
    "url": "http://llm_runner:8000"
  }
]
In this example:
  • RTX 4090: text-to-image warm. image-to-image loads cold on demand.
  • RTX 2060: audio-to-text and llm warm (both are low-VRAM pipelines that fit within 8 GB).

BYOC external containers

The url field in any aiModels.json entry points to an external container that handles inference for that pipeline. The AI worker passes jobs through and polls the container’s /health endpoint at startup.
BYOC audio-to-text aiModels.json entry
{
  "pipeline": "audio-to-text",
  "model_id": "openai/whisper-large-v3",
  "price_per_unit": 12882811,
  "url": "http://my-whisper-container:8000",
  "token": "optional-bearer-token",
  "capacity": 2
}
capacity sets how many concurrent jobs the external container handles. Set it from the container’s actual concurrency support. Default is 1. External containers must:
  1. Expose a /health endpoint that returns HTTP 200
  2. Handle inference requests in the format the AI worker sends (same contract as livepeer/ai-runner)
Common uses:
  • Ollama runner (as above)
  • Custom PyTorch / TensorRT / ONNX inference servers
  • K8s clusters or GPU farms behind a load balancer
  • Auto-scaling stacks (Docker Swarm, Nomad, Podman)
For building and registering custom containers, see Hosting Models (BYOC).

Pricing

AI inference pricing on Livepeer is set by operators and advertised on-chain. Gateways filter by maxPricePerUnit — jobs only reach orchestrators whose price falls below the gateway’s maximum.

Pricing units by pipeline

Setting competitive prices

Wei AI pricing example
{
  "pipeline": "text-to-image",
  "model_id": "SG161222/RealVisXL_V4.0_Lightning",
  "price_per_unit": 4768371
}
4768371 Wei is approximately 0.0005 USD per megapixel at ETH/USD rates from late 2025. To express prices directly in USD:
USD AI pricing example
"price_per_unit": "0.5e-3USD",
"currency": "USD"
Check current competitive pricing on the Livepeer Explorer AI Leaderboard — per-orchestrator earnings data shows which price tiers are earning the most jobs. Prices above the active gateway ceiling receive no jobs.

Monitoring your pipelines

Check container health:
List AI runner containers
docker ps --filter name=livepeer-ai-runner
All AI runner containers should show Up status. Containers in a restart loop need an immediate log check:
Inspect AI runner logs
docker logs <container_name> --tail 100
Verify network registration: Visit tools.livepeer.cloud/ai/network-capabilities and search for your orchestrator address. Each pipeline you’ve configured should appear with its status (Warm / Cold). Key log messages to watch:

Troubleshooting

Primary NVIDIA toolkit reference for this section: NVIDIA Container Toolkit install guide.
Most common causes:
  1. Wrong image tag — verify the livepeer/ai-runner image tag exists on Docker Hub. The -aiRunnerImage flag is deprecated; use -aiRunnerImageOverrides instead.
  2. VRAM OOM — the container starts, then crashes immediately after loading because warm: true exceeds available VRAM. Check docker logs <container_name> for OOM messages.
  3. NVIDIA Container Toolkit missing or misconfigured — run docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi. A passing result confirms the toolkit path. The installation guide is here as a primary reference: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
model_id must match the HuggingFace model ID exactly, including capitalisation and the / separator. Common mistakes:
  • Lowercase when the actual ID is mixed case
  • Missing the organisation prefix (RealVisXL_V4.0_Lightning instead of SG161222/RealVisXL_V4.0_Lightning)
  • Using an Ollama tag (llama3.1:8b) directly as model_id instead of the HuggingFace ID
For external containers, download the model inside the container before the AI worker starts. The AI worker polls /health at startup. A model that is still downloading fails the health check and the entry is skipped.
  1. Registration missing — confirm your capabilities appear on tools.livepeer.cloud/ai/network-capabilities. Missing entries usually mean the orchestrator needs to re-register capabilities after updating aiModels.json.
  2. Price too high — gateways don’t route to orchestrators above their maxPricePerUnit. Compare your price against active competitors on Livepeer Explorer.
  3. Model is cold — for competitive pipelines like text-to-image, set warm: true.
  4. Active-set gap — check your stake status on explorer.livepeer.org. AI pipeline jobs require the orchestrator to be in the active set.
The model loaded successfully but a specific request causes an out-of-memory error mid-run. Happens when a request asks for unusually large output dimensions (e.g. text-to-image at 2048×2048 on a 24 GB GPU).Mitigations:
  • Reduce maxSessions on your AI worker to limit concurrent jobs
  • Set "capacity": 1 in the affected aiModels.json entry
  • Consider DEEPCACHE or SFAST to reduce peak VRAM usage (diffusion pipelines only)
  1. Verify container reachability: from the host running your orchestrator, run curl http://llm_runner:8000/health — should return HTTP 200
  2. Check Docker network: the orchestrator and llm_runner container must share a Docker network for the hostname to resolve
  3. Re-register capabilities with the network after updating aiModels.json
  4. Confirm on tools.livepeer.cloud/ai/network-capabilities that your orchestrator appears under the llm pipeline
This is expected behaviour. SFAST compiles the model graph on the first inference call, which takes longer than normal. Subsequent calls benefit from the compiled graph. First-request job failures call for disabling SFAST and relying on native diffusion speed.

Watch: Batch AI on Livepeer

Canonical references for pipeline and model decisions

When configuring aiModels.json, two external references are authoritative: For supported models and pipeline compatibility: The AI Model Support page in the Developers section lists every pipeline type, supported model architectures, minimum VRAM, and current network status. This is the single source of truth for “will this model work on the network?” Use it before experimenting with untested model IDs. For understanding how gateways select your node: The Orchestrator Offerings reference documents the capability discovery protocol — specifically the capabilities_prices field structure and how gateways evaluate your node against their -maxPricePerCapability configuration. Before setting prices, confirm your prices fall within ranges that major gateways will accept. For custom models outside the standard pipeline list: Bring Your Own Container (BYOC) covers building a custom Docker container with PyTrickle integration to run any model on the network. BYOC is the path for proprietary models, fine-tuned checkpoints, or models with non-standard inference architectures.
Last modified on March 16, 2026