Image-to-Text
Overview
The image-to-text
pipeline converts images into text captions. This pipeline
is powered by the latest models in the HuggingFace
text-to-image
pipeline.
Models
Warm Models
The current warm model requested for the image-to-text
pipeline is:
For faster responses with different
image-to-text
diffusion models, ask Orchestrators to load it on their GPU via the ai-video
channel in Discord Server.
On-Demand Models
The following models have been tested and verified for the image-to-text
pipeline:
If a specific model you wish to use is not listed, please submit a feature request on GitHub to get the model verified and added to the list.
Basic Usage Instructions
For a detailed understanding of the image-to-text
endpoint and to experiment
with the API, see the Livepeer AI API
Reference.
To create an image caption using the image-to-text
pipeline, submit a POST
request to the Gateway’s image-to-text
API endpoint:
In this command:
<GATEWAY_IP>
should be replaced with your AI Gateway’s IP address.model_id
is the diffusion model to use.image
is the path to the image file to be captioned.
For additional optional parameters, refer to the Livepeer AI API Reference.
Orchestrator Configuration
To configure your Orchestrator to serve the image-to-text
pipeline, refer to
the Orchestrator Configuration guide.
System Requirements
The following system requirements are recommended for optimal performance:
- NVIDIA GPU with at least 4GB of VRAM.
Recommended Pipeline Pricing
We are planning to simplify the pricing in the future so orchestrators can set one AI price per compute unit and have the system automatically scale based on the model’s compute requirements.
The pricing for the image-to-text
pipeline is based on competitor pricing.
However, we strongly encourage orchestrators to set their own pricing based on
their costs and requirements. Setting a competitive price will help attract more
jobs, as Gateways can set their maximum price for a job. The current recommended
pricing for this pipeline is 2.5e-10 USD
per input pixel
(height * width
).
API Reference
API Reference
Explore the image-to-text
endpoint and experiment with the API in the
Livepeer AI API Reference.
Was this page helpful?