// AI AGENTS
SRE Agent Swarm
Autonomous incident response platform
PythonGoNode.jsNATS JetStreamFastAPIReactPrometheusLokiTempoChromaDBDocker
Overview
SRE Agent Swarm is a self-healing control plane that watches a fleet of microservices and resolves incidents without paging humans for the common cases. Six specialised agents coordinate over NATS JetStream — each owns one stage of the lifecycle.
The six agents
- Observer — ingests metrics from Prometheus, logs from Loki, and traces from Tempo. Detects anomalies and emits incident events.
- Diagnoser — runs LLM-assisted root cause analysis with a Gemini → OpenAI fallback chain. Cites relevant traces and log lines in its hypotheses.
- Remediator — selects a YAML runbook and executes it inside a sandboxed environment.
- Safety — gates every action against allowlists, rate limits, and blast-radius checks. Routes risky actions to humans.
- Orchestrator — coordinates the workflow, manages state machines, and persists incident timelines.
- Learner — RAG over ChromaDB of historical incidents. Surfaces relevant runbooks and post-mortems for similar past incidents.
Notable engineering decisions
- NATS JetStream over Kafka — lighter operational footprint for the scale we needed; built-in deduplication.
- Polyglot service mesh — Python for ML, Go for the hot-path data plane, Node for the React dashboard's BFF.
- MTTD/MTTR scored via chaos engineering — the platform is graded on its own performance against synthetic incidents.
Outcomes
- Reduced MTTD by ~40% in synthetic chaos scenarios
- Common incident classes resolved without human paging
- Full incident timelines surfaced in the React dashboard for retro reviews