Model and Demand Reference

GPU memory (VRAM) is the primary constraint for AI inference operators on Livepeer. The models you run, the number of pipelines you keep warm simultaneously, and your latency profile all follow from that VRAM budget. Use this reference for pipeline-level VRAM figures, warm model strategy, multi-GPU patterns, and complete aiModels.json field documentation.

Demand signals

VRAM is only one part of the earning equation. Start with a better question: which pipeline-model combinations are currently being routed by Gateways, and does your hardware keep one of them warm at a competitive price? Use these two signals together before loading a model:

A lightweight pipeline with visible demand usually beats an impressive model sitting outside current Gateway routing. Start from demand, then validate that the warm VRAM footprint fits your GPU with headroom.

VRAM by pipeline

These figures are production estimates based on operator deployments and community benchmarks. Actual usage varies with model variant, batch size, and resolution.

“Warm VRAM” = memory occupied while the model is resident and idle. “Peak inference VRAM” = maximum VRAM during active inference, and it often exceeds the idle footprint because of KV cache, activations, and output buffers.

GPU reference by persona

Consumer GPU tier (8–12 GB VRAM)

RTX 2060 (6 GB), RTX 3060 12 GB, RTX 2060 Super (8 GB), RTX 3060 Ti (8 GB) Viable pipelines:

llm – Llama 8B Q4 via Ollama (~6–8 GB)
image-to-text – BLIP large (~2 GB)
audio-to-text – Whisper large-v3 (~3 GB) ✅ fits on 8 GB cards
segment-anything-2 – SAM2 base model (~4–6 GB)

Leave off this tier: text-to-image, image-to-image, image-to-video, live-video-to-video Strategy for this tier: Run audio-to-text and image-to-text warm simultaneously – both fit easily in 8 GB and together give you two income streams. Add llm on a separate GPU when one is available.

Mid tier (16–20 GB VRAM)

RTX 3090 (24 GB, but close in practice with overheads), RTX 3080 Ti (12 GB – insufficient), A5000 (24 GB) Note: The effective threshold for diffusion pipelines is 24 GB. A nominal 16 GB card leaves insufficient headroom for SDXL warm + inference peaks. Viable pipelines on 24 GB:

text-to-image – SDXL Lightning warm ✅
image-to-image – SDXL Lightning cold (warm clashes with text-to-image)
audio-to-text – Whisper warm simultaneously (only 3 GB – fits alongside diffusion)
upscale – SD x4 upscaler warm

Strategy: Warm text-to-image as the primary pipeline. Co-warm audio-to-text as a secondary – they barely overlap in VRAM. Cold-load image-to-image on demand.

High tier (24 GB+ per GPU, multiple GPUs)

RTX 4090 (24 GB), A100 40/80 GB, H100 80 GB Full pipeline access. With multiple GPUs:

RTX 4090 × 2: warm SDXL on GPU0, warm Whisper + LLM on GPU1
A100 80 GB: warms multiple diffusion models and runs live-video-to-video simultaneously
H100: full fleet for multi-stream live-video AI at production scale

Warm vs cold – when it matters

Pipelines where warm is competitively critical

text-to-image – cold load on SDXL takes 30–90 seconds. Gateways route to warm competitors first. Running cold puts the node out of contention.
live-video-to-video – cold loading mid-stream causes noticeable interruption. Keep this pipeline warm.
image-to-image – competitive warm advantage, though less severe than text-to-image.

Pipelines where cold loading is acceptable

audio-to-text – Whisper loads in ~3–5 seconds. First-request latency is tolerable for transcription use cases.
image-to-text – BLIP is very fast to load. Cold loading is acceptable.
segment-anything-2 – keep it warm for segmentation workloads that need uninterrupted response.

The Beta warm model constraint

During the Beta phase, only one warm model per GPU is supported. Setting warm: true on more aiModels.json entries than you have GPUs causes the AI worker to log a conflict at startup and skip the excess warm entries. Example of what works: One RTX 4090, two entries – one warm slot available:

Single-GPU warm and cold example

[
  {
    "pipeline": "text-to-image",
    "model_id": "SG161222/RealVisXL_V4.0_Lightning",
    "price_per_unit": 4768371,
    "warm": true
  },
  {
    "pipeline": "audio-to-text",
    "model_id": "openai/whisper-large-v3",
    "price_per_unit": 12882811
  }
]

text-to-image is warm (your primary revenue pipeline). audio-to-text loads cold on demand. Exception: Whisper is small enough (3 GB) that some operators co-warm it alongside a diffusion model without conflict, but that result stays hardware-dependent. Monitor startup logs.

Multi-GPU configuration

The AI worker assigns GPU resources based on the order of entries in aiModels.json and available device IDs. For explicit multi-GPU assignment, use the CUDA device environment variable approach when launching containers:

Assign workloads to specific GPUs

# GPU 0 handles diffusion pipelines
docker run -d --gpus '"device=0"' --name ai-runner-gpu0 ...

# GPU 1 handles Whisper and LLM
docker run -d --gpus '"device=1"' --name ai-runner-gpu1 ...

For a node with RTX 4090 (GPU0) + RTX 2060 (GPU1):

Multi-GPU aiModels.json example

[
  {
    "pipeline": "text-to-image",
    "model_id": "SG161222/RealVisXL_V4.0_Lightning",
    "price_per_unit": 4768371,
    "warm": true
  },
  {
    "pipeline": "image-to-image",
    "model_id": "ByteDance/SDXL-Lightning",
    "price_per_unit": 4768371
  },
  {
    "pipeline": "audio-to-text",
    "model_id": "openai/whisper-large-v3",
    "price_per_unit": 12882811,
    "warm": true,
    "url": "http://whisper-runner:8001"
  },
  {
    "pipeline": "llm",
    "model_id": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "warm": true,
    "price_per_unit": 0.18,
    "currency": "USD",
    "pixels_per_unit": 1000000,
    "url": "http://llm_runner:8000"
  }
]

In this configuration: GPU0 handles both diffusion pipelines (warm text-to-image, cold image-to-image). GPU1 handles Whisper and LLM via external containers on its own VRAM.

aiModels.json complete schema

Every field, its type, valid values, and behaviour:

pipeline (required)

Type: string Description: The pipeline identifier. Must exactly match a supported pipeline name.Valid values:

Supported pipeline values

text-to-image
image-to-image
image-to-video
image-to-text
audio-to-text
segment-anything-2
text-to-speech
upscale
llm
live-video-to-video

model_id (required)

Type: string Description: HuggingFace model ID. Case-sensitive. Must include the organisation prefix.Examples:

"SG161222/RealVisXL_V4.0_Lightning" ✅
"RealVisXL_V4.0_Lightning" ❌ (missing org prefix)
"sg161222/realvisxl_v4.0_lightning" ❌ (wrong case)
"llama3.1:8b" ❌ (Ollama tag; use the HuggingFace ID here)

For llm pipeline, use the HuggingFace ID (meta-llama/Meta-Llama-3.1-8B-Instruct) even though the Ollama runner uses its own internal tag format.

price_per_unit (required)

Type: integer or string Description: Price per unit of work.Integer format – value in Wei:

Wei price_per_unit example

"price_per_unit": 4768371

USD string format – scientific notation with USD suffix:

USD price_per_unit example

"price_per_unit": "0.5e-3USD",
"currency": "USD"

When using USD notation, also set "currency": "USD".Pricing unit per pipeline:

warm (optional)

Type: boolean Default: false Description: Preload the model into GPU VRAM at container startup.Setting warm: true eliminates cold-start latency. One warm model per GPU during Beta.Models larger than the available VRAM fail at model load time when warm is true. Check docker logs for OOM messages.

pixels_per_unit (optional)

Type: integer Description: Units of work per pricing unit. Adjusts the effective per-unit cost granularity.Used primarily with audio-to-text to set per-millisecond pricing:

audio-to-text pixels_per_unit example

"pixels_per_unit": 1

For llm, sets the token count per pricing unit:

LLM pixels_per_unit example

"pixels_per_unit": 1000000

(1 pricing unit = 1 million tokens)

currency (optional)

Type: string Description: Currency for price_per_unit. Required when using USD notation.

currency field example

"currency": "USD"

url (optional)

Type: string Description: URL of an external container serving this pipeline. The AI worker treats the URL as a pass-through and polls /health at startup.Use this for:

Ollama LLM runner
Custom inference servers
K8s clusters or GPU farms

url field example

"url": "http://llm_runner:8000"

The container at this URL must:

Expose /health → return HTTP 200
Handle inference requests in the format the AI worker sends

token (optional)

Type: string Description: Bearer token for authenticating with the external container.

token field example

"token": "my-secret-token"

capacity (optional)

Type: integer Default: 1 Description: Maximum concurrent inference tasks from this container. Set based on the container’s actual concurrency support.

capacity field example

"capacity": 2

For Ollama with multiple loaded models or hardware built for parallel inference, increasing capacity improves throughput.

optimization_flags (optional)

Type: object Description: Performance optimisations for warm diffusion models. Applies to text-to-image, image-to-image, and upscale pipelines with warm: true.SFAST (Stable Fast, up to 25% speedup, no quality loss):

SFAST flag example

"optimization_flags": { "SFAST": true }

DEEPCACHE (up to 50% speedup, minor quality impact):

DEEPCACHE flag example

"optimization_flags": { "DEEPCACHE": true }

Choose one flag only. Skip DEEPCACHE on Lightning and Turbo models because those models are already step-optimised.Sources: Stable Fast · DeepCache

Model selection and earnings

Model earnings diverge for four clear reasons:

Pipeline demand – some pipelines receive more jobs from Gateways
Model match – Gateways often request specific model IDs; running the requested model gets you the job
Warm status – warm models win latency-competitive pipelines
Pricing competitiveness – prices above the Gateway’s maxPricePerUnit receive zero jobs

Tracking performance

tools.Livepeer.cloud/ai/network-capabilities – current capability visibility across all Orchestrators
explorer.livepeer.org – per-Orchestrator fee earnings and job count

Use the Explorer AI leaderboard to compare your earnings against similar nodes. Nodes with identical hardware but different warm model selections often show significantly different earnings – the warm pipeline choice matters more than raw GPU performance.

Model selection heuristics

Pricing strategy

Understanding the market

Prices are set by operators and enforced at the Gateway layer. A price_per_unit above a Gateway’s maxPricePerUnit removes your node from that Gateway’s job set, regardless of performance. The network is price-competitive. Setting prices too high means no jobs. Setting prices too low reduces earnings unnecessarily.

Reference pricing (late 2025)

These figures are approximate and shift with ETH/USD rates and network competition. Use explorer.livepeer.org for current market rates.

GPU economics at scale

A fully utilised RTX 4090 running text-to-image at competitive pricing and warm load earns strong fee revenue when demand, active-set position, and Gateway routing align. For an economics illustration with current network utilisation figures, see Orchestrator Economics.

Hosting custom models (BYOC)

The url field in aiModels.json allows any inference server to serve a pipeline, including the standard livepeer/ai-runner containers. This is the BYOC (Bring Your Own Container) path. BYOC use cases:

Running models outside the HuggingFace catalogue
Fine-tuned proprietary models
Custom inference architectures (TensorRT, ONNX, OpenVINO)
Models hosted in K8s or a GPU farm behind a load balancer

The only requirement is the /health endpoint contract and matching the AI worker’s request format. See Hosting Models (BYOC) for the full BYOC guide.

Batch AI Setup

Complete aiModels.json setup guide, Ollama LLM runner deployment, and pipeline configuration.

Cascade Setup

Cascade architecture, ComfyStream, and live-video-to-video pipeline deployment.

Hardware Requirements

Full hardware requirements for transcoding and AI workloads.

Hosting Models (BYOC)

Custom model hosting and external container integration.

Start Here

Concepts

Quickstart

Setup

Guides

Resources

Model and Demand Reference

Demand signals

VRAM by pipeline

GPU reference by persona

Consumer GPU tier (8–12 GB VRAM)

Mid tier (16–20 GB VRAM)

High tier (24 GB+ per GPU, multiple GPUs)

Warm vs cold – when it matters

Pipelines where warm is competitively critical

Pipelines where cold loading is acceptable

The Beta warm model constraint

Multi-GPU configuration

aiModels.json complete schema

Model selection and earnings

Tracking performance

Model selection heuristics

Pricing strategy

Understanding the market

Reference pricing (late 2025)

GPU economics at scale

Hosting custom models (BYOC)

Batch AI Setup

Cascade Setup

Hardware Requirements

Hosting Models (BYOC)

​Demand signals

​VRAM by pipeline

​GPU reference by persona

​Consumer GPU tier (8–12 GB VRAM)

​Mid tier (16–20 GB VRAM)

​High tier (24 GB+ per GPU, multiple GPUs)

​Warm vs cold – when it matters

​Pipelines where warm is competitively critical

​Pipelines where cold loading is acceptable

​The Beta warm model constraint

​Multi-GPU configuration

​aiModels.json complete schema

​Model selection and earnings

​Tracking performance

​Model selection heuristics

​Pricing strategy

​Understanding the market

​Reference pricing (late 2025)

​GPU economics at scale

​Hosting custom models (BYOC)

​Related

Batch AI Setup

Cascade Setup

Hardware Requirements

Hosting Models (BYOC)

Demand signals

VRAM by pipeline

GPU reference by persona

Consumer GPU tier (8–12 GB VRAM)

Mid tier (16–20 GB VRAM)

High tier (24 GB+ per GPU, multiple GPUs)

Warm vs cold – when it matters

Pipelines where warm is competitively critical

Pipelines where cold loading is acceptable

The Beta warm model constraint

Multi-GPU configuration

aiModels.json complete schema

Model selection and earnings

Tracking performance

Model selection heuristics

Pricing strategy

Understanding the market

Reference pricing (late 2025)

GPU economics at scale

Hosting custom models (BYOC)

Related