LLM
Overview
The llm
pipeline provides an OpenAI-compatible interface for text generation,
designed to integrate seamlessly into media workflows.
Models
The llm
pipeline supports any Hugging Face-compatible LLM model. Since
models evolve quickly, the set of warm (preloaded) models on Orchestrators
changes regularly.
To see which models are currently available, check the
Network Capabilities dashboard.
At the time of writing, the most commonly available model is
meta-llama/Meta-Llama-3.1-8B-Instruct.
For faster responses with different
LLM diffusion
models, ask Orchestrators to load it on their GPU via the ai-video
channel
in Discord Server.
Basic Usage Instructions
For a detailed understanding of the llm
endpoint and to experiment with the
API, see the Livepeer AI API Reference.
To generate text with the llm
pipeline, send a POST
request to the Gateway’s
llm
API endpoint:
In this command:
<GATEWAY_IP>
should be replaced with your AI Gateway’s IP address.<TOKEN>
should be replaced with your API token.model
is the LLM model to use for generation.messages
is the conversation or prompt input for the model.
For additional optional parameters such as temperature
, max_tokens
, or
stream
, refer to the Livepeer AI API Reference.
After execution, the Orchestrator processes the request and returns the response to the Gateway:
By default, responses are returned as a single JSON object. To stream output
token-by-token using Server-Sent Events (SSE), set "stream": true
in the
request body.
Orchestrator Configuration
To configure your Orchestrator to serve the llm
pipeline, refer to the
Orchestrator Configuration guide.
Tuning Environment Variables
The llm
pipeline supports several environment variables that can be adjusted
to optimize performance based on your hardware and workload. These are
particularly helpful for managing memory usage and parallelism when running
large models.
Enables 8-bit quantization using bitsandbytes
for lower memory usage. Set to
true
to enable. Defaults to false
.
Number of pipeline parallel stages. Should not exceed the number of model
layers. Defaults to 1
.
Number of tensor parallel units. Must divide evenly into the number of
attention heads in the model. Defaults to 1
.
Maximum number of tokens per input sequence. Defaults to 8192
.
Maximum number of tokens processed in a single batch. Should be greater than
or equal to MAX_MODEL_LEN
. Defaults to 8192
.
Maximum number of sequences processed per batch. Defaults to 128
.
Target GPU memory utilization as a float between 0
and 1
. Higher values
make fuller use of GPU memory. Defaults to 0.97
.
System Requirements
The following system requirements are recommended for optimal performance:
- NVIDIA GPU with at least 16GB of VRAM.
Recommended Pipeline Pricing
We are planning to simplify the pricing in the future so orchestrators can set one AI price per compute unit and have the system automatically scale based on the model’s compute requirements.
The /llm
pipeline is currently priced based on the maximum output tokens
specified in the request — not actual usage — due to current payment system
limitations. We’re actively working to support usage-based pricing to better
align with industry standards.
The LLM pricing landscape is highly competitive and rapidly evolving.
Orchestrators should set prices based on their infrastructure costs and
market positioning. As a reference, inference on
llama-3-8b-instruct
is currently around 0.08 USD
per 1 million output
tokens.
API Reference
API Reference
Explore the llm
endpoint and experiment with the API in the Livepeer AI API
Reference.
Was this page helpful?