llm
pipeline provides an OpenAI-compatible interface for text generation,
designed to integrate seamlessly into media workflows.
llm
pipeline supports any Hugging Face-compatible LLM model. Since
models evolve quickly, the set of warm (preloaded) models on Orchestrators
changes regularly.
To see which models are currently available, check the
Network Capabilities dashboard.ai-research
channel
in Discord Server.llm
endpoint and to experiment with the
API, see the Livepeer AI API Reference.llm
pipeline, send a POST
request to the Gateway’s
llm
API endpoint:
<GATEWAY_IP>
should be replaced with your AI Gateway’s IP address.<TOKEN>
should be replaced with your API token if required by the AI Gateway.model
is the LLM model to use for generation.messages
is the conversation or prompt input for the model.temperature
, max_tokens
, or
stream
, refer to the Livepeer AI API Reference.
After execution, the Orchestrator processes the request and returns the response
to the Gateway which forwards the response in response to the request.
Example partial non-streaming response below:
"stream": true
in the
request body.
llm
pipeline, refer to the
Orchestrator Configuration guide.
llm
pipeline supports several environment variables that can be adjusted
to optimize performance based on your hardware and workload. These are
particularly helpful for managing memory usage and parallelism when running
large models.
bitsandbytes
for lower memory usage. Set to
true
to enable. Defaults to false
.1
.1
.8192
.MAX_MODEL_LEN
. Defaults to 8192
.128
.0
and 1
. Higher values
make fuller use of GPU memory. Defaults to 0.85
./llm
pipeline is currently priced based on the maximum output tokens
specified in the request — not actual usage — due to current payment system
limitations. We’re actively working to support usage-based pricing to better
align with industry standards.
The LLM pricing landscape is highly competitive and rapidly evolving.
Orchestrators should set prices based on their infrastructure costs and
market positioning. As a reference, inference on
llama-3-8b-instruct
is currently around 0.08 USD
per 1 million output
tokens.
llm
endpoint and experiment with the API in the Livepeer AI API
Reference.