Skip to content
Monitoring

Monitoring

Prometheus Metrics

MetricTypeLabelsDescription
drop_images_cached_totalCounterimage, nodeTotal images successfully cached
drop_pull_duration_secondsHistogramimageDuration of pull operations
drop_pull_errors_totalCounterimage, nodeTotal failed pull attempts
drop_discovery_images_foundGaugepolicy, source_typeImages found per discovery source
drop_active_pullsGaugeCurrently active pull Pods
drop_reconcile_totalCountercontroller, resultReconciliation attempts

Enable ServiceMonitor

helm install drop oci://ghcr.io/breee/charts/drop \
  --set serviceMonitor.enabled=true

Example Queries

# Pull success rate
rate(drop_images_cached_total[1h])

# p95 pull duration
histogram_quantile(0.95, rate(drop_pull_duration_seconds_bucket[1h]))

# Error rate by image
rate(drop_pull_errors_total[1h])

# Active pulls right now
drop_active_pulls

Kubernetes Events

ReasonTypeDescription
PullStartedNormalImage pull Pod created on a node
PullSucceededNormalImage successfully cached on a node
PullFailedWarningImage pull failed on a node
kubectl get events --field-selector involvedObject.kind=CachedImage

Status Conditions

All resources use metav1.Condition with type Ready:

status:
  conditions:
    - type: Ready
      status: "True"
      reason: Cached
      message: "Image cached on all 5 target nodes"

Health Endpoints

EndpointPortDescription
/healthz8081Liveness probe
/readyz8081Readiness probe
Last updated on