// AI AGENTS

SRE Agent Swarm

Autonomous incident response platform

PythonGoNode.jsNATS JetStreamFastAPIReactPrometheusLokiTempoChromaDBDocker

Overview

SRE Agent Swarm is a self-healing control plane that watches a fleet of microservices and resolves incidents without paging humans for the common cases. Six specialised agents coordinate over NATS JetStream — each owns one stage of the lifecycle.

The six agents

Observer — ingests metrics from Prometheus, logs from Loki, and traces from Tempo. Detects anomalies and emits incident events.
Diagnoser — runs LLM-assisted root cause analysis with a Gemini → OpenAI fallback chain. Cites relevant traces and log lines in its hypotheses.
Remediator — selects a YAML runbook and executes it inside a sandboxed environment.
Safety — gates every action against allowlists, rate limits, and blast-radius checks. Routes risky actions to humans.
Orchestrator — coordinates the workflow, manages state machines, and persists incident timelines.
Learner — RAG over ChromaDB of historical incidents. Surfaces relevant runbooks and post-mortems for similar past incidents.

Notable engineering decisions

NATS JetStream over Kafka — lighter operational footprint for the scale we needed; built-in deduplication.
Polyglot service mesh — Python for ML, Go for the hot-path data plane, Node for the React dashboard's BFF.
MTTD/MTTR scored via chaos engineering — the platform is graded on its own performance against synthetic incidents.

Outcomes

Reduced MTTD by ~40% in synthetic chaos scenarios
Common incident classes resolved without human paging
Full incident timelines surfaced in the React dashboard for retro reviews