Monitoring
Observability for Agentfield deployments
Monitoring
Set up Prometheus metrics, logs, and alerting
Agentfield exposes Prometheus metrics, structured logs, and health endpoints. This guide covers production observability setup.
Quick Start
Control plane exposes metrics at /metrics and logs to stdout. That's all you need to get started.
Essentials:
- Metrics:
/metricsendpoint (Prometheus format) - Health:
/healthendpoint (JSON) - Logs: Structured JSON to stdout
- Traces: Execution DAGs in Web UI
Prometheus Metrics
Exposed Metrics
Control plane exposes these metrics at http://control-plane:8080/metrics:
Execution metrics:
agentfield_executions_total- Total executions (counter)agentfield_execution_duration_seconds- Execution latency (histogram)agentfield_execution_failures_total- Failed executions (counter)
Queue metrics:
agentfield_queue_depth- Async queue size (gauge)agentfield_queue_workers- Active workers (gauge)agentfield_queue_throughput- Executions/minute (gauge)
Agent metrics:
agentfield_active_agents- Registered agents (gauge)agentfield_agent_heartbeat_failures- Missed heartbeats (counter)
System metrics:
agentfield_http_requests_total- HTTP traffic (counter)agentfield_http_request_duration_seconds- HTTP latency (histogram)agentfield_database_connections- Active DB connections (gauge)
Prometheus Setup
Add Prometheus to your stack:
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
volumes:
prometheus-data:# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'agentfield-control-plane'
static_configs:
- targets: ['control-plane:8080']
metrics_path: '/metrics'Access: http://localhost:9090
Use ServiceMonitor (requires Prometheus Operator):
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: agentfield-cp
namespace: agentfield
spec:
selector:
matchLabels:
app: agentfield-control-plane
endpoints:
- port: http
path: /metrics
interval: 30sDeploy: kubectl apply -f servicemonitor.yaml
Run Prometheus locally:
# prometheus.yml
scrape_configs:
- job_name: 'agentfield'
static_configs:
- targets: ['localhost:8080']
# Run
prometheus --config.file=prometheus.ymlAccess: http://localhost:9090
Key Prometheus Queries
# Request rate
rate(agentfield_http_requests_total[5m])
# 95th percentile execution time
histogram_quantile(0.95, rate(agentfield_execution_duration_seconds_bucket[5m]))
# Queue depth over time
agentfield_queue_depth
# Active agents
agentfield_active_agents
# Error rate
rate(agentfield_execution_failures_total[5m]) / rate(agentfield_executions_total[5m])
# Database connection usage
agentfield_database_connections / agentfield_database_max_connectionsGrafana Dashboards
Setup Grafana
# docker-compose.yml
services:
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
grafana-data:Access: http://localhost:3000 (admin/admin)
# Using Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana -n monitoringConfigure Data Source
- Login to Grafana
- Configuration → Data Sources → Add data source
- Select Prometheus
- URL:
http://prometheus:9090(Docker) orhttp://prometheus-server(K8s) - Save & Test
Import Dashboard
Create a dashboard with these panels:
Panel 1: Request Rate
rate(agentfield_http_requests_total[5m])Panel 2: Execution Latency (P50, P95, P99)
histogram_quantile(0.50, rate(agentfield_execution_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(agentfield_execution_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(agentfield_execution_duration_seconds_bucket[5m]))Panel 3: Queue Depth
agentfield_queue_depthPanel 4: Active Agents
agentfield_active_agentsPanel 5: Error Rate
rate(agentfield_execution_failures_total[5m])Logging
Log Format
Control plane logs structured JSON to stdout:
{
"level": "info",
"msg": "Execution completed",
"execution_id": "exec_abc123",
"agent": "support-agent",
"reasoner": "triage_ticket",
"duration_ms": 1247,
"status": "completed",
"timestamp": "2025-01-15T10:23:45Z"
}Agent logs (Python SDK):
{
"level": "info",
"msg": "Reasoner invoked",
"node_id": "support-agent",
"reasoner": "triage_ticket",
"timestamp": "2025-01-15T10:23:45Z"
}Log Aggregation
Use Docker's logging driver:
# docker-compose.yml
services:
control-plane:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"Or ship to external system (Loki, Elasticsearch):
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
promtail:
image: grafana/promtail:latest
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers
- ./promtail-config.yml:/etc/promtail/config.ymlLogs go to stdout → Kubelet → your log aggregator:
Option 1: Fluentd/Fluent Bit
# DaemonSet collects from all pods
kubectl apply -f https://raw.githubusercontent.com/fluent/fluent-bit-kubernetes-logging/master/output/elasticsearch/fluent-bit-ds.yamlOption 2: Loki + Promtail
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack -n monitoringConfigure Grafana data source → Loki → Query logs by label:
{app="agentfield-control-plane"}Platforms provide built-in log aggregation:
- Railway: Real-time logs in dashboard
- Render: Logs tab in service view
- Heroku:
heroku logs --tail - Fly.io:
fly logs
Or export to external system (Datadog, Loggly):
# Railway example
railway logs --json | datadog-agentUseful Log Queries
Loki/Grafana:
# All errors
{app="agentfield-control-plane"} |= "error"
# Slow executions (>5s)
{app="agentfield-control-plane"} | json | duration_ms > 5000
# Specific agent errors
{app="agentfield-control-plane"} | json | agent="support-agent" | level="error"Alerting
Prometheus Alertmanager
# docker-compose.yml
services:
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml# alertmanager.yml
route:
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
title: 'Agentfield Alert'helm install alertmanager prometheus-community/alertmanager -n monitoringAlert Rules
# prometheus-rules.yml
groups:
- name: agentfield
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(agentfield_execution_failures_total[5m]) > 0.1
for: 5m
annotations:
summary: "High execution error rate"
- alert: QueueBacklog
expr: agentfield_queue_depth > 1000
for: 5m
annotations:
summary: "Queue depth is high"
- alert: AgentDown
expr: agentfield_active_agents == 0
for: 2m
annotations:
summary: "No active agents"
- alert: DatabaseConnectionsHigh
expr: (agentfield_database_connections / agentfield_database_max_connections) > 0.8
for: 5m
annotations:
summary: "Database connection pool nearing capacity"Alert Channels
Slack:
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'PagerDuty:
pagerduty_configs:
- service_key: 'your-integration-key'Email:
email_configs:
- to: 'ops@company.com'
from: 'alerts@company.com'
smarthost: 'smtp.gmail.com:587'Health Checks
Control plane: http://control-plane:8080/health
{
"status": "healthy",
"database": "connected",
"uptime_seconds": 1247,
"version": "v1.0.0"
}Agents: http://agent:8001/health
{
"status": "healthy",
"node_id": "support-agent",
"control_plane": "connected",
"last_heartbeat": "2025-01-15T10:23:45Z"
}Use for:
- Load balancer health checks
- Kubernetes liveness/readiness probes
- Uptime monitoring (Pingdom, UptimeRobot)
Execution DAG Visualization
Built into control plane Web UI at http://control-plane:8080:
- Navigate to Executions
- Click execution ID
- View DAG showing:
- Agent call chain
- Execution times
- Input/output for each step
- Errors (if any)
Example DAG:
support-agent.handle_ticket (1.2s)
├─> sentiment-agent.analyze (0.3s)
├─> resolution-bot.generate (0.7s)
└─> escalation-agent.alert (0.2s)Production Monitoring Checklist
- Prometheus scraping control plane
/metrics - Grafana dashboard with key panels
- Log aggregation configured (Loki, ELK, Datadog)
- Alerts for high error rate, queue backlog, agent down
- Health check monitoring (uptime service)
- Database metrics monitored (connections, slow queries)
- Retention policies set (logs, metrics, executions)
- On-call rotation configured for critical alerts
Related Documentation
- Deployment Overview - Architecture and patterns
- Kubernetes Deployment - K8s monitoring setup
- Environment Variables - Log level configuration
- CLI Reference - Control plane metrics config
Monitor everything. Agentfield's structured logs and Prometheus metrics give you complete visibility into your agent systems.