LLM Pipeline Setup

The llm pipeline is the entry point for operators with older or smaller GPUs. A quantised 8B parameter model runs within 8 GB VRAM – opening AI participation to cards that cannot run diffusion models at all.

The llm pipeline uses a different architecture from all other Livepeer AI pipelines. Where diffusion and audio pipelines use the standard livepeer/ai-runner container, the LLM pipeline routes through an Ollama-based runner maintained by Cloud SPE. This enables quantised large language models to run on consumer GPUs with 8 GB of VRAM or more. The pipeline flow is:

LLM pipeline flow

go-livepeer → livepeer-ollama-runner → ollama container → quantised model

go-livepeer reaches the LLM stack over HTTP instead of managing model weights directly. The Ollama runner and Ollama container run as separate Docker services, and go-livepeer connects to them via the url field in aiModels.json.

Architecture split

All other batch AI pipelines (text-to-image, audio-to-text, segment-anything-2, text-to-speech) use the livepeer/ai-runner container. Go-livepeer spawns that container automatically based on aiModels.json and manages its lifecycle. The llm pipeline requires you to run the Ollama stack manually:

Ollama container – the model runtime that loads and serves quantised LLM weights
livepeer-ollama-runner – a shim container that translates between go-livepeer’s AI worker protocol and the Ollama API

go-livepeer connects to the livepeer-ollama-runner via the url field. The runner must be reachable on a shared Docker network.

Setup

Prerequisites

Docker and Docker Compose installed
NVIDIA Container Toolkit configured (for GPU passthrough)
An existing go-livepeer Orchestrator with -aiWorker enabled
8 GB or more of GPU VRAM (minimum for quantised 7B/8B models)

Model selection for 8 GB VRAM

Quantised models reduce precision (typically from float32 to 4-bit integer) to fit within smaller VRAM budgets with minimal quality reduction. Ollama handles quantisation automatically via its model tags. For 8 GB VRAM GPUs, use llama3.1:8b or mistral:7b. The Gemma 2 9B typically requires closer to 10 GB, so single 8 GB cards should stay on the 7B to 8B class.

Model ID mapping

The Ollama tag (llama3.1:8b) and the Livepeer model_id (meta-llama/Meta-Llama-3.1-8B-Instruct) are different naming conventions for the same model family. Ollama uses its own tag format internally; go-livepeer uses HuggingFace IDs for on-chain capability advertisement. Both identify the same underlying model. The aiModels.json entry uses the HuggingFace ID in model_id, while the ollama pull command uses the Ollama tag.

Pricing the LLM pipeline

LLM pricing differs from pixel-based pipelines. Use USD notation with pixels_per_unit as a token-count proxy:

LLM pricing in aiModels.json

{
  "pipeline": "llm",
  "model_id": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "price_per_unit": 0.18,
  "currency": "USD",
  "pixels_per_unit": 1000000,
  "warm": true,
  "url": "http://llm_runner:8000"
}

This example sets a rate of

0.18 per million tokens (equivalent to

0.18/1M tokens, a competitive rate for 8B parameter models as of early 2026). Adjust based on your GPU’s inference throughput and current market rates. Check tools.Livepeer.cloud/ai/network-capabilities for current LLM pipeline pricing from other Orchestrators before setting your rate.

Testing locally

After the stack is running, test the Ollama runner directly before routing live traffic:

Test LLM inference locally

# Check Ollama is running and the model is loaded
docker exec -it ollama ollama list

# Test inference via the runner (adjust port if different)
curl -X POST http://localhost:8000/llm \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1:8b", "prompt": "Hello"}'

Verify the runner health endpoint is responding:

Check the runner health endpoint

curl http://localhost:8000/health
# Expected: HTTP 200

AI Inference Operations

aiModels.json reference and full pipeline architecture including the url field for external containers.

Diffusion Pipeline Setup

text-to-image, image-to-image, and other diffusion pipelines requiring the standard ai-runner.

Audio and Vision Pipelines

audio-to-text, text-to-speech, image-to-text, and segment-anything-2 setup.

AI Model Management

Warm vs cold strategy and optimisation flags for AI pipelines.

Start Here

Concepts

Quickstart

Setup

Guides

Resources

Architecture split

Setup

Prerequisites

Model selection for 8 GB VRAM

Model ID mapping

Pricing the LLM pipeline

Testing locally

AI Inference Operations

Diffusion Pipeline Setup

Audio and Vision Pipelines

AI Model Management

​Architecture split

​Setup

​Prerequisites

​Model selection for 8 GB VRAM

​Model ID mapping

​Pricing the LLM pipeline

​Testing locally

​Related pages

AI Inference Operations

Diffusion Pipeline Setup

Audio and Vision Pipelines

AI Model Management

Architecture split

Setup

Prerequisites

Model selection for 8 GB VRAM

Model ID mapping

Pricing the LLM pipeline

Testing locally

Related pages