Agentfield exposes Prometheus metrics, structured logs, and health endpoints. This guide covers production observability setup.

Quick Start

Control plane exposes metrics at /metrics and logs to stdout. That's all you need to get started.

Essentials:

Metrics: /metrics endpoint (Prometheus format)
Health: /health endpoint (JSON)
Logs: Structured JSON to stdout
Traces: Execution DAGs in Web UI

Prometheus Metrics

Exposed Metrics

Control plane exposes these metrics at http://control-plane:8080/metrics:

Execution metrics:

agentfield_executions_total - Total executions (counter)
agentfield_execution_duration_seconds - Execution latency (histogram)
agentfield_execution_failures_total - Failed executions (counter)

Queue metrics:

agentfield_queue_depth - Async queue size (gauge)
agentfield_queue_workers - Active workers (gauge)
agentfield_queue_throughput - Executions/minute (gauge)

Agent metrics:

agentfield_active_agents - Registered agents (gauge)
agentfield_agent_heartbeat_failures - Missed heartbeats (counter)

System metrics:

agentfield_http_requests_total - HTTP traffic (counter)
agentfield_http_request_duration_seconds - HTTP latency (histogram)
agentfield_database_connections - Active DB connections (gauge)

Prometheus Setup

Add Prometheus to your stack:

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

volumes:
  prometheus-data:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'agentfield-control-plane'
    static_configs:
      - targets: ['control-plane:8080']
    metrics_path: '/metrics'

Access: http://localhost:9090

Use ServiceMonitor (requires Prometheus Operator):

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: agentfield-cp
  namespace: agentfield
spec:
  selector:
    matchLabels:
      app: agentfield-control-plane
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

Deploy: kubectl apply -f servicemonitor.yaml

Run Prometheus locally:

# prometheus.yml
scrape_configs:
  - job_name: 'agentfield'
    static_configs:
      - targets: ['localhost:8080']

# Run
prometheus --config.file=prometheus.yml

Access: http://localhost:9090

Key Prometheus Queries

# Request rate
rate(agentfield_http_requests_total[5m])

# 95th percentile execution time
histogram_quantile(0.95, rate(agentfield_execution_duration_seconds_bucket[5m]))

# Queue depth over time
agentfield_queue_depth

# Active agents
agentfield_active_agents

# Error rate
rate(agentfield_execution_failures_total[5m]) / rate(agentfield_executions_total[5m])

# Database connection usage
agentfield_database_connections / agentfield_database_max_connections

Grafana Dashboards

Setup Grafana

# docker-compose.yml
services:
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

Access: http://localhost:3000 (admin/admin)

# Using Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana -n monitoring

Configure Data Source

Login to Grafana
Configuration → Data Sources → Add data source
Select Prometheus
URL: http://prometheus:9090 (Docker) or http://prometheus-server (K8s)
Save & Test

Import Dashboard

Create a dashboard with these panels:

Panel 1: Request Rate

rate(agentfield_http_requests_total[5m])

Panel 2: Execution Latency (P50, P95, P99)

histogram_quantile(0.50, rate(agentfield_execution_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(agentfield_execution_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(agentfield_execution_duration_seconds_bucket[5m]))

Panel 3: Queue Depth

agentfield_queue_depth

Panel 4: Active Agents

agentfield_active_agents

Panel 5: Error Rate

rate(agentfield_execution_failures_total[5m])

Logging

Log Format

Control plane logs structured JSON to stdout:

{
  "level": "info",
  "msg": "Execution completed",
  "execution_id": "exec_abc123",
  "agent": "support-agent",
  "reasoner": "triage_ticket",
  "duration_ms": 1247,
  "status": "completed",
  "timestamp": "2025-01-15T10:23:45Z"
}

Agent logs (Python SDK):

{
  "level": "info",
  "msg": "Reasoner invoked",
  "node_id": "support-agent",
  "reasoner": "triage_ticket",
  "timestamp": "2025-01-15T10:23:45Z"
}

Log Aggregation

Use Docker's logging driver:

# docker-compose.yml
services:
  control-plane:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

Or ship to external system (Loki, Elasticsearch):

services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/lib/docker/containers:/var/lib/docker/containers
      - ./promtail-config.yml:/etc/promtail/config.yml

Logs go to stdout → Kubelet → your log aggregator:

Option 1: Fluentd/Fluent Bit

# DaemonSet collects from all pods
kubectl apply -f https://raw.githubusercontent.com/fluent/fluent-bit-kubernetes-logging/master/output/elasticsearch/fluent-bit-ds.yaml

Option 2: Loki + Promtail

helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack -n monitoring

Configure Grafana data source → Loki → Query logs by label:

{app="agentfield-control-plane"}

Platforms provide built-in log aggregation:

Railway: Real-time logs in dashboard
Render: Logs tab in service view
Heroku: heroku logs --tail
Fly.io: fly logs

Or export to external system (Datadog, Loggly):

# Railway example
railway logs --json | datadog-agent

Useful Log Queries

Loki/Grafana:

# All errors
{app="agentfield-control-plane"} |= "error"

# Slow executions (>5s)
{app="agentfield-control-plane"} | json | duration_ms > 5000

# Specific agent errors
{app="agentfield-control-plane"} | json | agent="support-agent" | level="error"

Alerting

Prometheus Alertmanager

# docker-compose.yml
services:
  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

# alertmanager.yml
route:
  receiver: 'slack'
receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: 'Agentfield Alert'

helm install alertmanager prometheus-community/alertmanager -n monitoring

Alert Rules

# prometheus-rules.yml
groups:
  - name: agentfield
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(agentfield_execution_failures_total[5m]) > 0.1
        for: 5m
        annotations:
          summary: "High execution error rate"

      - alert: QueueBacklog
        expr: agentfield_queue_depth > 1000
        for: 5m
        annotations:
          summary: "Queue depth is high"

      - alert: AgentDown
        expr: agentfield_active_agents == 0
        for: 2m
        annotations:
          summary: "No active agents"

      - alert: DatabaseConnectionsHigh
        expr: (agentfield_database_connections / agentfield_database_max_connections) > 0.8
        for: 5m
        annotations:
          summary: "Database connection pool nearing capacity"

Alert Channels

Slack:

slack_configs:
  - api_url: 'https://hooks.slack.com/services/...'
    channel: '#alerts'

PagerDuty:

pagerduty_configs:
  - service_key: 'your-integration-key'

Email:

email_configs:
  - to: 'ops@company.com'
    from: 'alerts@company.com'
    smarthost: 'smtp.gmail.com:587'

Health Checks

Control plane: http://control-plane:8080/health

{
  "status": "healthy",
  "database": "connected",
  "uptime_seconds": 1247,
  "version": "v1.0.0"
}

Agents: http://agent:8001/health

{
  "status": "healthy",
  "node_id": "support-agent",
  "control_plane": "connected",
  "last_heartbeat": "2025-01-15T10:23:45Z"
}

Use for:

Load balancer health checks
Kubernetes liveness/readiness probes
Uptime monitoring (Pingdom, UptimeRobot)

Execution DAG Visualization

Built into control plane Web UI at http://control-plane:8080:

Navigate to Executions
Click execution ID
View DAG showing:
- Agent call chain
- Execution times
- Input/output for each step
- Errors (if any)

Example DAG:

support-agent.handle_ticket (1.2s)
  ├─> sentiment-agent.analyze (0.3s)
  ├─> resolution-bot.generate (0.7s)
  └─> escalation-agent.alert (0.2s)

Production Monitoring Checklist

Prometheus scraping control plane /metrics
Grafana dashboard with key panels
Log aggregation configured (Loki, ELK, Datadog)
Alerts for high error rate, queue backlog, agent down
Health check monitoring (uptime service)
Database metrics monitored (connections, slow queries)
Retention policies set (logs, metrics, executions)
On-call rotation configured for critical alerts

Deployment Overview - Architecture and patterns
Kubernetes Deployment - K8s monitoring setup
Environment Variables - Log level configuration
CLI Reference - Control plane metrics config

Monitor everything. Agentfield's structured logs and Prometheus metrics give you complete visibility into your agent systems.