Diffusion Pipeline Setup

Batch AI inference is the most accessible entry point to the Livepeer AI network. An application sends a request – a text prompt, an image, an audio file – your node processes it and returns the result. You earn per-unit fees for every successful job. Run batch AI by configuring aiModels.json, choosing and loading models, connecting external runners where needed, setting prices, and checking health and routing once the worker is live.

Use this guide once your Orchestrator node is already running and connected to the network. Nodes still in initial setup should start with Run an Orchestrator.

Prerequisites

Before configuring AI pipelines, ensure:

go-livepeer is running with the -aiWorker flag enabled
NVIDIA Container Toolkit is installed and working (docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi)
Docker is running with GPU access
You have a ~/.lpData/aiModels.json file or know where you want to create one

How the AI worker runs pipelines

When go-livepeer starts with -aiWorker, it reads aiModels.json and starts Docker containers for each configured pipeline:

go-livepeer
    ↓ reads aiModels.json
    ↓ pulls livepeer/ai-runner containers
GPU containers start per pipeline entry
    ↓ each container loads its model
AI worker advertises capabilities to network
    ↓
Gateways start routing matching jobs

The standard container image is livepeer/ai-runner. Except for the llm pipeline – which uses a separate Ollama-based runner – all batch pipelines use this image. The AI worker manages container lifecycle: starting, health-checking, and restarting containers automatically.

aiModels.json – full reference

aiModels.json is the single file that controls everything about your AI worker: which pipelines you run, which models you load, whether they stay warm in VRAM, and how you price each job. Default location: ~/.lpData/aiModels.json Override location: set with -aiModels flag at startup

Minimal working example

Minimal aiModels.json example

[
  {
    "pipeline": "text-to-image",
    "model_id": "SG161222/RealVisXL_V4.0_Lightning",
    "price_per_unit": 4768371,
    "warm": true
  }
]

This single entry is enough to start earning from the text-to-image pipeline with a competitive warm model.

Complete field reference

model_id must match the HuggingFace model ID exactly, including capitalisation and the organisation prefix. A typo here will cause the container to fail at model load time. During Beta, only one warm model per GPU is supported – setting warm: true on more entries than you have GPUs will cause a conflict at startup.

Supported pipelines and recommended models

text-to-image

Generate images from text prompts. The highest-demand pipeline on the network.

text-to-image aiModels.json entry

{
  "pipeline": "text-to-image",
  "model_id": "SG161222/RealVisXL_V4.0_Lightning",
  "price_per_unit": 4768371,
  "warm": true
}

VRAM requirement: 24 GB Pricing unit: Per output pixel Why Lightning? SDXL-Lightning reduces inference to 4 steps (vs 20–50 for standard SDXL), delivering results in under 2 seconds on an RTX 4090. Gateways and users strongly prefer fast models for this pipeline. Alternative models:

ByteDance/SDXL-Lightning – similar performance, different base
stabilityai/stable-diffusion-xl-base-1.0 – higher quality, slower

Source: SG161222/RealVisXL_V4.0_Lightning · ByteDance/SDXL-Lightning

image-to-image

Apply diffusion-based transformations, style transfer, or enhancement to an input image.

image-to-image aiModels.json entry

{
  "pipeline": "image-to-image",
  "model_id": "ByteDance/SDXL-Lightning",
  "price_per_unit": 4768371
}

VRAM requirement: 24 GB Pricing unit: Per output pixel Note: This fits on the same 24 GB GPU as text-to-image when cold-loading is acceptable.

image-to-video

Animate a still image into a short video clip. Compute-intensive – expect longer per-job times.

image-to-video aiModels.json entry

{
  "pipeline": "image-to-video",
  "model_id": "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
  "price_per_unit": 9536742
}

VRAM requirement: 24 GB (32 GB or multi-GPU preferred for longer clips) Pricing unit: Per output pixel

image-to-text

Generate text descriptions of images. Accessible to operators with older or lower-end GPUs.

image-to-text aiModels.json entry

{
  "pipeline": "image-to-text",
  "model_id": "Salesforce/blip-image-captioning-large",
  "price_per_unit": 1192093,
  "warm": true
}

VRAM requirement: 4 GB Pricing unit: Per input pixel Why this matters: Operators with 8–12 GB VRAM GPUs still contribute to the network through image-to-text and audio-to-text. Source: Salesforce/blip-image-captioning-large

audio-to-text

Speech recognition and transcription with timestamps. Backed by Whisper-large-v3.

audio-to-text aiModels.json entry

{
  "pipeline": "audio-to-text",
  "model_id": "openai/whisper-large-v3",
  "price_per_unit": 12882811,
  "pixels_per_unit": 1,
  "warm": true
}

VRAM requirement: 12 GB Pricing unit: Per millisecond of audio Model: openai/whisper-large-v3 is the current network standard for accuracy and is the model most Gateway operators request. Source: openai/whisper-large-v3

segment-anything-2

Promptable segmentation – returns pixel masks for objects or regions in an image or video frame.

segment-anything-2 aiModels.json entry

{
  "pipeline": "segment-anything-2",
  "model_id": "facebook/sam2-hiera-large",
  "price_per_unit": 4768371
}

VRAM requirement: 12–24 GB depending on model variant Pricing unit: Per input pixel Source: facebookresearch/segment-anything-2

upscale

Upscale low-resolution images to high resolution using diffusion-based super-resolution.

upscale aiModels.json entry

{
  "pipeline": "upscale",
  "model_id": "stabilityai/stable-diffusion-x4-upscaler",
  "price_per_unit": 4768371,
  "warm": true,
  "optimization_flags": {
    "SFAST": true
  }
}

VRAM requirement: 16–24 GB Pricing unit: Per input pixel

text-to-speech

Text-to-natural-speech synthesis. Growing use case for AI video narration.

text-to-speech aiModels.json entry

{
  "pipeline": "text-to-speech",
  "model_id": "suno/bark",
  "price_per_unit": 5960465
}

Pricing unit: Per character or per ms of output audio

LLM inference – the Ollama runner

The llm pipeline uses a different architecture from all other batch pipelines. Instead of the standard livepeer/ai-runner container, it uses an Ollama-based runner maintained by Cloud SPE. This enables quantised LLMs to run on GPUs with as little as 8 GB VRAM.

LLM pipeline flow

go-livepeer -> livepeer-ollama-runner -> ollama container -> quantised model

Why Ollama?

Standard diffusion pipelines require 24 GB VRAM and server-class GPUs. The Ollama runner opens participation to older consumer GPUs (GTX 1080, RTX 2060) that would otherwise contribute nothing to the AI network. Quantised LLMs – especially 7B and 8B parameter models – run efficiently within 8–12 GB VRAM. Source: tztcloud/livepeer-ollama-runner on Docker Hub · Cloud SPE LLM pipeline guide

Setup

Supported models via Ollama (at time of writing):

Warm vs cold models

Warm: The model is preloaded into GPU VRAM at container startup. Any job request is served immediately – no model loading latency. Cold: The model is loaded on first request. The container exists while the weights stay on disk until the first request triggers a model load, typically 10–60 seconds depending on model size and NVMe speed.

Impact on job assignment

Gateways track Orchestrator latency. Nodes with fast first-response times win more jobs. For latency-sensitive pipelines – especially text-to-image and image-to-image – running cold puts you at a clear competitive disadvantage. Rule of thumb: Warm your primary revenue pipeline. Cold the rest.

Beta constraint: Only one warm model per GPU is supported during the Beta phase. Setting warm: true on more entries than you have GPUs makes the AI worker log a conflict error at startup and skip the excess entries. Check logs for Error loading warm model to identify conflicts.

VRAM planning for warm models

A 24 GB GPU supports one large diffusion model warm, or a combination of smaller pipelines simultaneously. See Model Hosting and VRAM Planning for multi-model patterns.

Optimisation flags

optimization_flags apply only to warm: true diffusion models (text-to-image, image-to-image, upscale). Both flags are experimental. Primary references: Stable Fast and DeepCache.

SFAST - Stable Fast (up to 25% faster)

Enables the Stable Fast optimisation framework. Compiles the diffusion model’s compute graph on first run to eliminate redundant operations.

Speedup: Up to 25% faster inference
Quality impact: None
Tradeoff: First inference is slower (compilation overhead). Subsequent runs are faster.

SFAST optimization flag

"optimization_flags": { "SFAST": true }

Best for: High-throughput operators with frequent repeated requests on the same model.Source: chengzeyi/stable-fast on GitHub

DEEPCACHE - Deferred computation (up to 50% faster)

Caches intermediate diffusion steps to reduce redundant recomputation across inference calls.

Speedup: Up to 50% faster inference
Quality impact: Minor (slight reduction in fine detail at high step counts)
Tradeoff: Quality degradation is more noticeable at low step counts.

DEEPCACHE optimization flag

"optimization_flags": { "DEEPCACHE": true }

Skip Lightning and Turbo models here. These models are already step-optimised for 1–4 inference steps. Applying DEEPCACHE to them degrades output quality without a clear speed benefit.Source: DeepCache paper and implementation

SFAST and DEEPCACHE cannot be combined. Choose one or neither.

Running multiple pipelines

A complete multi-pipeline aiModels.json for a node with one RTX 4090 (24 GB) and one RTX 2060 (8 GB):

Multi-pipeline aiModels.json example

[
  {
    "pipeline": "text-to-image",
    "model_id": "SG161222/RealVisXL_V4.0_Lightning",
    "price_per_unit": 4768371,
    "warm": true,
    "optimization_flags": {
      "SFAST": true
    }
  },
  {
    "pipeline": "image-to-image",
    "model_id": "ByteDance/SDXL-Lightning",
    "price_per_unit": 4768371
  },
  {
    "pipeline": "audio-to-text",
    "model_id": "openai/whisper-large-v3",
    "price_per_unit": 12882811,
    "warm": true
  },
  {
    "pipeline": "llm",
    "model_id": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "warm": true,
    "price_per_unit": 0.18,
    "currency": "USD",
    "pixels_per_unit": 1000000,
    "url": "http://llm_runner:8000"
  }
]

In this example:

RTX 4090: text-to-image warm. image-to-image loads cold on demand.
RTX 2060: audio-to-text and llm warm (both are low-VRAM pipelines that fit within 8 GB).

BYOC external containers

The url field in any aiModels.json entry points to an external container that handles inference for that pipeline. The AI worker passes jobs through and polls the container’s /health endpoint at startup.

BYOC audio-to-text aiModels.json entry

{
  "pipeline": "audio-to-text",
  "model_id": "openai/whisper-large-v3",
  "price_per_unit": 12882811,
  "url": "http://my-whisper-container:8000",
  "token": "optional-bearer-token",
  "capacity": 2
}

capacity sets how many concurrent jobs the external container handles. Set it from the container’s actual concurrency support. Default is 1. External containers must:

Expose a /health endpoint that returns HTTP 200
Handle inference requests in the format the AI worker sends (same contract as livepeer/ai-runner)

Common uses:

Ollama runner (as above)
Custom PyTorch / TensorRT / ONNX inference servers
K8s clusters or GPU farms behind a load balancer
Auto-scaling stacks (Docker Swarm, Nomad, Podman)

For building and registering custom containers, see Hosting Models (BYOC).

Pricing

AI inference pricing on Livepeer is set by operators and advertised on-chain. Gateways filter by maxPricePerUnit – jobs only reach Orchestrators whose price falls below the Gateway’s maximum.

Pricing units by pipeline

Setting competitive prices

Wei AI pricing example

{
  "pipeline": "text-to-image",
  "model_id": "SG161222/RealVisXL_V4.0_Lightning",
  "price_per_unit": 4768371
}

4768371 Wei is approximately 0.0005 USD per megapixel at ETH/USD rates from late 2025. To express prices directly in USD:

USD AI pricing example

"price_per_unit": "0.5e-3USD",
"currency": "USD"

Check current competitive pricing on the Livepeer Explorer AI Leaderboard – per-Orchestrator earnings data shows which price tiers are earning the most jobs. Prices above the active Gateway ceiling receive no jobs.

Monitoring your pipelines

Check container health:

List AI runner containers

docker ps --filter name=livepeer-ai-runner

All AI runner containers should show Up status. Containers in a restart loop need an immediate log check:

Inspect AI runner logs

docker logs <container_name> --tail 100

Verify network registration: Visit tools.Livepeer.cloud/ai/network-capabilities and search for your Orchestrator address. Each pipeline you’ve configured should appear with its status (Warm / Cold). Key log messages to watch:

Troubleshooting

Primary NVIDIA toolkit reference for this section: NVIDIA Container Toolkit install guide.

Start the AI runner container

Most common causes:

Wrong image tag – verify the livepeer/ai-runner image tag exists on Docker Hub. The -aiRunnerImage flag is deprecated; use -aiRunnerImageOverrides instead.
VRAM OOM – the container starts, then crashes immediately after loading because warm: true exceeds available VRAM. Check docker logs <container_name> for OOM messages.
NVIDIA Container Toolkit missing or misconfigured – run docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi. A passing result confirms the toolkit path. The installation guide is here as a primary reference: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

Fix model ID loading errors

model_id must match the HuggingFace model ID exactly, including capitalisation and the / separator. Common mistakes:

Lowercase when the actual ID is mixed case
Missing the organisation prefix (RealVisXL_V4.0_Lightning instead of SG161222/RealVisXL_V4.0_Lightning)
Using an Ollama tag (llama3.1:8b) directly as model_id instead of the HuggingFace ID

For external containers, download the model inside the container before the AI worker starts. The AI worker polls /health at startup. A model that is still downloading fails the health check and the entry is skipped.

Pipeline receiving no jobs

Registration missing – confirm your capabilities appear on tools.Livepeer.cloud/ai/network-capabilities. Missing entries usually mean the Orchestrator needs to re-register capabilities after updating aiModels.json.
Price too high – Gateways don’t route to Orchestrators above their maxPricePerUnit. Compare your price against active competitors on Livepeer Explorer.
Model is cold – for competitive pipelines like text-to-image, set warm: true.
Active-set gap – check your stake status on explorer.livepeer.org. AI pipeline jobs require the Orchestrator to be in the Active Set.

OOM during inference

The model loaded successfully but a specific request causes an out-of-memory error mid-run. Happens when a request asks for unusually large output dimensions (e.g. text-to-image at 2048×2048 on a 24 GB GPU).Mitigations:

Reduce maxSessions on your AI worker to limit concurrent jobs
Set "capacity": 1 in the affected aiModels.json entry
Consider DEEPCACHE or SFAST to reduce peak VRAM usage (diffusion pipelines only)

Restore Ollama LLM job flow

Verify container reachability: from the host running your Orchestrator, run curl http://llm_runner:8000/health – should return HTTP 200
Check Docker network: the Orchestrator and llm_runner container must share a Docker network for the hostname to resolve
Re-register capabilities with the network after updating aiModels.json
Confirm on tools.Livepeer.cloud/ai/network-capabilities that your Orchestrator appears under the llm pipeline

SFAST causing first-request latency

This is expected behaviour. SFAST compiles the model graph on the first inference call, which takes longer than normal. Subsequent calls benefit from the compiled graph. First-request job failures call for disabling SFAST and relying on native diffusion speed.

Watch: Batch AI on Livepeer

Canonical references for pipeline and model decisions

When configuring aiModels.json, two external references are authoritative: For supported models and pipeline compatibility: The AI Model Support page in the Developers section lists every pipeline type, supported model architectures, minimum VRAM, and current network status. This is the single source of truth for “will this model work on the network?” Use it before experimenting with untested model IDs. For understanding how Gateways select your node: The Orchestrator Offerings reference documents the capability discovery protocol – specifically the capabilities_prices field structure and how Gateways evaluate your node against their -maxPricePerCapability configuration. Before setting prices, confirm your prices fall within ranges that major Gateways will accept. For custom models outside the standard pipeline list: Bring Your Own Container (BYOC) covers building a custom Docker container with PyTrickle integration to run any model on the network. BYOC is the path for proprietary models, fine-tuned checkpoints, or models with non-standard inference architectures.

AI Model Support

Canonical list of supported pipelines, model architectures, VRAM requirements, and network status. The authoritative reference before adding a new model.

Bring Your Own Container (BYOC)

Run any custom model on the network using PyTrickle – for models not covered by the standard AI Runner containers.

Orchestrator Offerings Reference

How Gateways discover Orchestrators and evaluate capability/pricing – the selection algorithm that determines whether your node receives jobs.

Model Hosting and VRAM Planning

VRAM table by pipeline, warm model strategy, and multi-GPU configuration.

Cascade Setup

Deploy live-video-to-video pipelines for live streaming AI effects.

AI Workloads Overview

Pipeline types, batch vs live-video AI, and how jobs flow to your node.

​Prerequisites

​How the AI worker runs pipelines

​aiModels.json – full reference

​Minimal working example

​Complete field reference

​Supported pipelines and recommended models

​text-to-image

​image-to-image

​image-to-video

​image-to-text

​audio-to-text

​segment-anything-2

​upscale

​text-to-speech

​LLM inference – the Ollama runner

​Why Ollama?

​Setup

​Warm vs cold models

​Impact on job assignment

​VRAM planning for warm models

​Optimisation flags

​Running multiple pipelines

​BYOC external containers

​Pricing

​Pricing units by pipeline

​Setting competitive prices

​Monitoring your pipelines

​Troubleshooting

​Watch: Batch AI on Livepeer

​Canonical references for pipeline and model decisions

​Related

AI Model Support

Bring Your Own Container (BYOC)

Orchestrator Offerings Reference

Model Hosting and VRAM Planning

Cascade Setup

AI Workloads Overview

Prerequisites

How the AI worker runs pipelines

aiModels.json – full reference

Minimal working example

Complete field reference

Supported pipelines and recommended models

text-to-image

image-to-image

image-to-video

image-to-text

audio-to-text

segment-anything-2

upscale

text-to-speech

LLM inference – the Ollama runner

Why Ollama?

Setup

Warm vs cold models

Impact on job assignment

VRAM planning for warm models

Optimisation flags

Running multiple pipelines

BYOC external containers

Pricing

Pricing units by pipeline

Setting competitive prices

Monitoring your pipelines

Troubleshooting

Watch: Batch AI on Livepeer

Canonical references for pipeline and model decisions

Related