Overview

The llm pipeline provides an OpenAI-compatible interface for text generation, designed to integrate seamlessly into media workflows.

Models

The llm pipeline supports any Hugging Face-compatible LLM model. Since models evolve quickly, the set of warm (preloaded) models on Orchestrators changes regularly.

To see which models are currently available, check the Network Capabilities dashboard.
At the time of writing, the most commonly available model is meta-llama/Meta-Llama-3.1-8B-Instruct.

For faster responses with different LLM diffusion models, ask Orchestrators to load it on their GPU via the ai-video channel in Discord Server.

Basic Usage Instructions

For a detailed understanding of the llm endpoint and to experiment with the API, see the Livepeer AI API Reference.

To generate text with the llm pipeline, send a POST request to the Gateway’s llm API endpoint:

curl -X POST "https://<GATEWAY_IP>/llm" \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      { "role": "user", "content": "Tell a robot story." }
    ]
  }'

In this command:

  • <GATEWAY_IP> should be replaced with your AI Gateway’s IP address.
  • <TOKEN> should be replaced with your API token.
  • model is the LLM model to use for generation.
  • messages is the conversation or prompt input for the model.

For additional optional parameters such as temperature, max_tokens, or stream, refer to the Livepeer AI API Reference.

After execution, the Orchestrator processes the request and returns the response to the Gateway:

{
  "id": "chatcmpl-abc123",
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Once upon a time, in a gleaming city of circuits..."
      }
    }
  ]
}

By default, responses are returned as a single JSON object. To stream output token-by-token using Server-Sent Events (SSE), set "stream": true in the request body.

Orchestrator Configuration

To configure your Orchestrator to serve the llm pipeline, refer to the Orchestrator Configuration guide.

Tuning Environment Variables

The llm pipeline supports several environment variables that can be adjusted to optimize performance based on your hardware and workload. These are particularly helpful for managing memory usage and parallelism when running large models.

USE_8BIT
boolean

Enables 8-bit quantization using bitsandbytes for lower memory usage. Set to true to enable. Defaults to false.

PIPELINE_PARALLEL_SIZE
integer

Number of pipeline parallel stages. Should not exceed the number of model layers. Defaults to 1.

TENSOR_PARALLEL_SIZE
integer

Number of tensor parallel units. Must divide evenly into the number of attention heads in the model. Defaults to 1.

MAX_MODEL_LEN
integer

Maximum number of tokens per input sequence. Defaults to 8192.

MAX_NUM_BATCHED_TOKENS
integer

Maximum number of tokens processed in a single batch. Should be greater than or equal to MAX_MODEL_LEN. Defaults to 8192.

MAX_NUM_SEQS
integer

Maximum number of sequences processed per batch. Defaults to 128.

GPU_MEMORY_UTILIZATION
float

Target GPU memory utilization as a float between 0 and 1. Higher values make fuller use of GPU memory. Defaults to 0.97.

System Requirements

The following system requirements are recommended for optimal performance:

We are planning to simplify the pricing in the future so orchestrators can set one AI price per compute unit and have the system automatically scale based on the model’s compute requirements.

The /llm pipeline is currently priced based on the maximum output tokens specified in the request — not actual usage — due to current payment system limitations. We’re actively working to support usage-based pricing to better align with industry standards.

The LLM pricing landscape is highly competitive and rapidly evolving. Orchestrators should set prices based on their infrastructure costs and market positioning. As a reference, inference on llama-3-8b-instruct is currently around 0.08 USD per 1 million output tokens.

API Reference

API Reference

Explore the llm endpoint and experiment with the API in the Livepeer AI API Reference.