The
llm pipeline uses a different architecture from all other Livepeer AI pipelines. Where diffusion and audio pipelines use the standard livepeer/ai-runner container, the LLM pipeline routes through an Ollama-based runner maintained by Cloud SPE. This enables quantised large language models to run on consumer GPUs with 8 GB of VRAM or more.
The pipeline flow is:
LLM pipeline flow
url field in aiModels.json.
Architecture split
All other batch AI pipelines (text-to-image, audio-to-text, segment-anything-2, text-to-speech) use thelivepeer/ai-runner container. go-livepeer spawns that container automatically based on aiModels.json and manages its lifecycle.
The llm pipeline requires you to run the Ollama stack manually:
- Ollama container — the model runtime that loads and serves quantised LLM weights
- livepeer-ollama-runner — a shim container that translates between go-livepeer’s AI worker protocol and the Ollama API
livepeer-ollama-runner via the url field. The runner must be reachable on a shared Docker network.
Setup
Prerequisites
- Docker and Docker Compose installed
- NVIDIA Container Toolkit configured (for GPU passthrough)
- An existing go-livepeer orchestrator with
-aiWorkerenabled - 8 GB or more of GPU VRAM (minimum for quantised 7B/8B models)
Model selection for 8 GB VRAM
Quantised models reduce precision (typically from float32 to 4-bit integer) to fit within smaller VRAM budgets with minimal quality reduction. Ollama handles quantisation automatically via its model tags. For 8 GB VRAM GPUs, usellama3.1:8b or mistral:7b. The Gemma 2 9B typically requires closer to 10 GB, so single 8 GB cards should stay on the 7B to 8B class.
Model ID mapping
The Ollama tag (llama3.1:8b) and the Livepeer model_id (meta-llama/Meta-Llama-3.1-8B-Instruct) are different naming conventions for the same model family. Ollama uses its own tag format internally; go-livepeer uses HuggingFace IDs for on-chain capability advertisement.
Both identify the same underlying model. The aiModels.json entry uses the HuggingFace ID in model_id, while the ollama pull command uses the Ollama tag.
Pricing the LLM pipeline
LLM pricing differs from pixel-based pipelines. Use USD notation withpixels_per_unit as a token-count proxy:
LLM pricing in aiModels.json
Testing locally
After the stack is running, test the Ollama runner directly before routing live traffic:Test LLM inference locally
Check the runner health endpoint
Related pages
AI Inference Operations
aiModels.json reference and full pipeline architecture including the url field for external containers.
Diffusion Pipeline Setup
text-to-image, image-to-image, and other diffusion pipelines requiring the standard ai-runner.
Audio and Vision Pipelines
audio-to-text, text-to-speech, image-to-text, and segment-anything-2 setup.
AI Model Management
Warm vs cold strategy and optimisation flags for AI pipelines.