all projects

// AI AGENTS

SRE Agent Swarm

Autonomous incident response platform

PythonGoNode.jsNATS JetStreamFastAPIReactPrometheusLokiTempoChromaDBDocker

Overview

SRE Agent Swarm is a self-healing control plane that watches a fleet of microservices and resolves incidents without paging humans for the common cases. Six specialised agents coordinate over NATS JetStream — each owns one stage of the lifecycle.

The six agents

  • Observer — ingests metrics from Prometheus, logs from Loki, and traces from Tempo. Detects anomalies and emits incident events.
  • Diagnoser — runs LLM-assisted root cause analysis with a Gemini → OpenAI fallback chain. Cites relevant traces and log lines in its hypotheses.
  • Remediator — selects a YAML runbook and executes it inside a sandboxed environment.
  • Safety — gates every action against allowlists, rate limits, and blast-radius checks. Routes risky actions to humans.
  • Orchestrator — coordinates the workflow, manages state machines, and persists incident timelines.
  • Learner — RAG over ChromaDB of historical incidents. Surfaces relevant runbooks and post-mortems for similar past incidents.

Notable engineering decisions

  • NATS JetStream over Kafka — lighter operational footprint for the scale we needed; built-in deduplication.
  • Polyglot service mesh — Python for ML, Go for the hot-path data plane, Node for the React dashboard's BFF.
  • MTTD/MTTR scored via chaos engineering — the platform is graded on its own performance against synthetic incidents.

Outcomes

  • Reduced MTTD by ~40% in synthetic chaos scenarios
  • Common incident classes resolved without human paging
  • Full incident timelines surfaced in the React dashboard for retro reviews