Metrics and Alerting

Use this page to instrument the node, collect the metrics that matter, and wire up alerts before failures turn into missed jobs or missed rewards. Logs explain individual incidents; metrics show whether throughput, latency, capacity, and ticket flow are moving in the right direction.

Enabling node metrics

go-livepeer exposes a standard Prometheus endpoint when you pass the -monitor flag at startup.

Enable monitoring flags

livepeer \
  -orchestrator \
  -transcoder \
  -monitor \
  -metricsPerStream \
  -network arbitrum-one-mainnet \
  # ... Other flags

Metrics endpoint: http://localhost:7935/metrics This is on the same port as the go-livepeer CLI (7935 by default). The -monitor flag activates the /metrics path on that port. Monitoring flags:

Split Orchestrator and transcoder setups should pass -monitor on both processes when both sides need to be scraped. Each process exposes its own /metrics endpoint on its respective CLI port.

What metrics are exposed

go-livepeer exposes metrics across multiple categories. The full reference is at Prometheus Metrics Reference. The metrics you will actually act on:

Session metrics

Current active sessions, max session capacity, sessions per GPU. These tell you whether your node is at capacity or idle.

Segment metrics

Segments received, transcoded, and failed. Success rate over time is your core transcoding health signal.

Ticket metrics

Winning tickets received and redeemed. A gap between the two indicates ETH balance or redemption issues.

Latency metrics

Processing time per segment. High latency means your GPU is saturated or a pipeline is slow – both affect Gateway scoring.

Other available metrics:

GPU utilisation (where reported by the NVIDIA driver)
ETH balance and pending fees
Round number and reward call status

Option A: Docker monitoring stack (fastest setup)

Livepeer maintains a Docker image that bundles Prometheus, Grafana, and starter dashboard templates. This is the quickest path from zero to a working dashboard.

Run the Docker monitoring stack

# Pull the image
docker pull livepeer/monitoring

# Run against a single local node
docker run --net=host \
  --env LP_MODE=standalone \
  --env LP_NODES=localhost:7935 \
  livepeer/monitoring:latest

Then open Grafana at http://localhost:3000 (default credentials: admin / admin). Multi-node:

Run the monitoring stack for multiple nodes

docker run --net=host \
  --env LP_MODE=standalone \
  --env LP_NODES=node1.yourdomain.com:7935,node2.yourdomain.com:7935 \
  livepeer/monitoring:latest

All environment variable options: For Kubernetes deployments, pods with the prometheus.io/scrape label are discovered automatically without specifying LP_NODES.

livepeer/livepeer-monitoring

Source, Dockerfile, Prometheus.yml, and Grafana dashboard templates.

Option B: Custom Prometheus and Grafana

Operators already running a monitoring stack should add go-livepeer as a scrape target:

Prometheus scrape target

# prometheus.yml
scrape_configs:
  - job_name: 'livepeer-orchestrator'
    static_configs:
      - targets: ['localhost:7935']
    scrape_interval: 15s
    metrics_path: /metrics

Reload Prometheus after editing: kill -HUP <prometheus-pid> or use the reload API at http://localhost:9090/-/reload. Useful Grafana panels to build:

Monitoring AI Runner containers

AI inference workloads run in the ai-runner Docker container alongside go-livepeer. Monitoring the container separately from the node gives you a faster signal for AI-specific issues. Check container health:

Check AI runner health

# List running containers and their status
docker ps --filter name=livepeer-ai-runner

# Follow container logs live
docker logs -f livepeer-ai-runner

# Check resource usage
docker stats livepeer-ai-runner

Key log messages to watch for: Verify AI pipelines are registered on the network: After starting your AI Runner and go-livepeer, check that your pipelines appear to the network:

Query registered capabilities

# Query your node's registered capabilities
curl http://localhost:7935/getNetworkCapabilities | jq

Also check the community tool at tools.Livepeer.cloud/ai/network-capabilities – this shows which AI-capable Orchestrators are visible network-wide and which models are warm.

Log capture and verbose output

By default, livepeer sends all logs to stdout only. For long-running production nodes, capture logs to a file:

Capture logs with tee

# Pipe to both terminal and a log file
livepeer \
  -orchestrator \
  -transcoder \
  -monitor \
  ... \
  2>&1 | tee /var/log/livepeer/livepeer.log

Verbose logging for debugging:

Enable verbose logging

# -v 6 adds detailed transcoding activity logs
livepeer \
  -orchestrator \
  -transcoder \
  -v 6 \
  ...

With -v 6 you will see individual segment reception and transcoding activity, which is the fastest way to confirm your node is receiving and processing work without a full Prometheus setup. Useful log search patterns:

Search logs for key events

# Check for reward calls
grep -i "reward" /var/log/livepeer/livepeer.log

# Check for transcoding sessions
grep -i "transcode\|session" /var/log/livepeer/livepeer.log

# Check for errors only
grep -i "error\|fail\|crit" /var/log/livepeer/livepeer.log

# Check for incoming jobs
grep -i "received\|segment" /var/log/livepeer/livepeer.log

For systemd-managed services, use journalctl instead:

Inspect systemd logs

journalctl -u livepeer -f               # follow live
journalctl -u livepeer --since "1 hour ago"   # recent
journalctl -u livepeer -p err           # errors only

Alerting

Prometheus users should add alerting rules for the most impactful failure modes:

Prometheus alert rules

# prometheus-alerts.yml — add to your Prometheus config
groups:
  - name: livepeer-orchestrator
    rules:

      - alert: OrchestratorAtCapacity
        expr: "livepeer_current_sessions_total >= livepeer_max_sessions"
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Orchestrator at session capacity for 5+ minutes"

      - alert: LowETHBalance
        expr: "livepeer_eth_balance < 0.02"
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ETH balance below 0.02 — reward calls may fail"

      - alert: TranscodeFailureRate
        expr: "rate(livepeer_transcode_failed_total[10m]) > 0.1"
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Transcode failure rate above 10% over 10 minutes"

Prometheus Metrics Reference

Full list of every metric exported by go-livepeer.

Troubleshooting

When metrics reveal a problem – how to diagnose and fix it.

AI Configuration

Setting up aiModels.json and the AI Runner container.

Session Limits

Configuring -maxSessions to match your hardware capacity.

Start Here

Concepts

Quickstart

Setup

Guides

Resources

Enabling node metrics

What metrics are exposed

Session metrics

Segment metrics

Ticket metrics

Latency metrics

Option A: Docker monitoring stack (fastest setup)

livepeer/livepeer-monitoring

Option B: Custom Prometheus and Grafana

Monitoring AI Runner containers

Log capture and verbose output

Alerting

Prometheus Metrics Reference

Troubleshooting

AI Configuration

Session Limits

​Enabling node metrics

​What metrics are exposed

Session metrics

Segment metrics

Ticket metrics

Latency metrics

​Option A: Docker monitoring stack (fastest setup)

livepeer/livepeer-monitoring

​Option B: Custom Prometheus and Grafana

​Monitoring AI Runner containers

​Log capture and verbose output

​Alerting

Prometheus Metrics Reference

Troubleshooting

AI Configuration

Session Limits

Enabling node metrics

What metrics are exposed

Option A: Docker monitoring stack (fastest setup)

Option B: Custom Prometheus and Grafana

Monitoring AI Runner containers

Log capture and verbose output

Alerting