Skip to main content
The llm pipeline is the entry point for operators with older or smaller GPUs. A quantised 8B parameter model runs within 8 GB VRAM — opening AI participation to cards that cannot run diffusion models at all.

The llm pipeline uses a different architecture from all other Livepeer AI pipelines. Where diffusion and audio pipelines use the standard livepeer/ai-runner container, the LLM pipeline routes through an Ollama-based runner maintained by Cloud SPE. This enables quantised large language models to run on consumer GPUs with 8 GB of VRAM or more. The pipeline flow is:
LLM pipeline flow
go-livepeer → livepeer-ollama-runner → ollama container → quantised model
go-livepeer reaches the LLM stack over HTTP instead of managing model weights directly. The Ollama runner and Ollama container run as separate Docker services, and go-livepeer connects to them via the url field in aiModels.json.

Architecture split

All other batch AI pipelines (text-to-image, audio-to-text, segment-anything-2, text-to-speech) use the livepeer/ai-runner container. go-livepeer spawns that container automatically based on aiModels.json and manages its lifecycle. The llm pipeline requires you to run the Ollama stack manually:
  • Ollama container — the model runtime that loads and serves quantised LLM weights
  • livepeer-ollama-runner — a shim container that translates between go-livepeer’s AI worker protocol and the Ollama API
go-livepeer connects to the livepeer-ollama-runner via the url field. The runner must be reachable on a shared Docker network.

Setup

Prerequisites

  • Docker and Docker Compose installed
  • NVIDIA Container Toolkit configured (for GPU passthrough)
  • An existing go-livepeer orchestrator with -aiWorker enabled
  • 8 GB or more of GPU VRAM (minimum for quantised 7B/8B models)

Model selection for 8 GB VRAM

Quantised models reduce precision (typically from float32 to 4-bit integer) to fit within smaller VRAM budgets with minimal quality reduction. Ollama handles quantisation automatically via its model tags. For 8 GB VRAM GPUs, use llama3.1:8b or mistral:7b. The Gemma 2 9B typically requires closer to 10 GB, so single 8 GB cards should stay on the 7B to 8B class.

Model ID mapping

The Ollama tag (llama3.1:8b) and the Livepeer model_id (meta-llama/Meta-Llama-3.1-8B-Instruct) are different naming conventions for the same model family. Ollama uses its own tag format internally; go-livepeer uses HuggingFace IDs for on-chain capability advertisement. Both identify the same underlying model. The aiModels.json entry uses the HuggingFace ID in model_id, while the ollama pull command uses the Ollama tag.

Pricing the LLM pipeline

LLM pricing differs from pixel-based pipelines. Use USD notation with pixels_per_unit as a token-count proxy:
LLM pricing in aiModels.json
{
  "pipeline": "llm",
  "model_id": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "price_per_unit": 0.18,
  "currency": "USD",
  "pixels_per_unit": 1000000,
  "warm": true,
  "url": "http://llm_runner:8000"
}
This example sets a rate of 0.18permilliontokens(equivalentto0.18 per million tokens (equivalent to 0.18/1M tokens, a competitive rate for 8B parameter models as of early 2026). Adjust based on your GPU’s inference throughput and current market rates. Check tools.livepeer.cloud/ai/network-capabilities for current LLM pipeline pricing from other orchestrators before setting your rate.

Testing locally

After the stack is running, test the Ollama runner directly before routing live traffic:
Test LLM inference locally
# Check Ollama is running and the model is loaded
docker exec -it ollama ollama list

# Test inference via the runner (adjust port if different)
curl -X POST http://localhost:8000/llm \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1:8b", "prompt": "Hello"}'
Verify the runner health endpoint is responding:
Check the runner health endpoint
curl http://localhost:8000/health
# Expected: HTTP 200
Last modified on March 16, 2026