How I Built a Self-Healing AI Agent Swarm
Architecture walkthrough of a multi-agent AI system with intelligent routing, provider failover, credit-based circuit breakers, and self-healing monitoring — all running on a VPS.
Prerequisites
- A VPS with SSH access (see our beginner VPS guide)
- Basic Linux command line knowledge
- API keys for at least one AI provider
Everyone’s excited about connecting Claude to a notes app. I went a different direction.
I built a multi-agent system with 5 specialized AI agents, 3 cloud providers, automatic failover, credit-based routing, and self-healing monitoring. It handles everything from “what’s for dinner” to “debug this production issue” — and it decides which AI model to use for each.
Here’s the full architecture.
The Problem With “One Agent to Rule Them All”
The Claude + Obsidian approach works. You get persistent memory, some tool access, and a chatbot that remembers your preferences. That’s genuinely useful.
But I kept hitting the same wall: the wrong model was answering the wrong questions.
Asking a frontier model “what’s the weather?” is like hiring a surgeon to put on a band-aid. Asking a cheap model to architect a distributed system is like asking the band-aid to do surgery.
I needed routing.
The Swarm
The platform runs on my local machine and a VPS, connected bidirectionally. Here’s who lives in the swarm:
Hub (The Dispatcher)
Every message I send — across multiple chat channels — hits the Hub first. Hub classifies the message into one of 9 categories and routes it to the right agent on the right model.
Hub itself runs on free/cheap models for trivial stuff. It never touches an expensive model unless it has to.
Main Agent (The Daily Driver)
My primary agent for complex reasoning. Runs on the best reasoning model available, with fallbacks to two other providers. This is who I’m talking to when I need actual thinking.
Code Agent (The Specialist)
Pure code tasks. Runs the best code-specialized model available. Minimal tool access, restricted filesystem scope. It can’t break things it can’t reach.
Security Agent (The Watchdog)
Runs over a dozen automated security checks on a regular schedule:
- File permission audits
- Secrets-in-git detection
- Open port scanning
- Process anomaly detection
- Login monitoring
- Disk and log usage tracking
The Security Agent is read-only. It detects and reports. It never modifies files. If it finds something critical, it alerts me immediately. Lower-priority warnings get batched. An agent that can both detect problems AND fix them is an agent that can cause problems.
Personal Agent
My partner has their own agent on a separate channel. Completely independent from the swarm — own personality, own workspace, own context.
The Routing Table
This is the core insight: match the task to the cheapest model that can handle it.
| Category | Model Tier | Why |
|---|---|---|
| Trivial (“hi”, “thanks”) | Free tier | Why pay for pleasantries? |
| Simple (quick questions) | Fast mid-tier | Good enough, cheap |
| Lookups (facts, definitions) | Fast mid-tier | Retrieval, not reasoning |
| Data (analysis, summaries) | Fast mid-tier | Structured output, bulk work |
| Ops (infra management) | Strong mid-tier | Needs deeper context |
| Code | Code-specialized | Purpose-built |
| Security | Strong mid-tier | Pattern matching at scale |
| Complex (multi-step reasoning) | Top reasoning | Best reasoning available |
| Hardest (architecture, strategy) | Maximum capability | When nothing else will do |
Every fallback chain spans all 3 major AI providers. No single point of failure.
The Credit Economy
Each agent gets a daily credit budget, refilled automatically. This isn’t a billing system — it’s a circuit breaker.
If a rogue loop starts hammering the API, it burns through credits and stops. Hub falls back to the free tier and alerts me instantly.
A guard script validates every state change. It blocks dangerous operations like task deletions, agent removals, and routing wipes. Every write gets auto-backed up before it happens.
I’ve never had a runaway cost incident. The circuit breaker makes sure I never will. Even if you trust your code, add circuit breakers from day one. The one time you need them, you’ll be very glad they exist.
The Infrastructure
Here’s how the two machines work together:
┌──────────────────────┐ ┌──────────────────────┐
│ Local (Primary) │◄───►│ VPS (Always-On) │
│ │ │ │
│ Gateway │ │ Gateway │
│ AI Agents │ │ Database + Cache │
│ Dev CLI │ │ Workflow Engine │
│ │ │ Monitoring │
│ │ │ Reverse Proxy │
│ │ │ Local LLM │
└──────────────────────┘ └──────────────────────┘
Bidirectional delegation via gateways
Local machine handles the heavy AI work — all agents live here, along with development tools.
VPS handles the always-on work — monitoring, workflow automation, database, cache, and a local LLM for offline/private tasks. Everything sits behind a reverse proxy with automatic HTTPS. No ports directly exposed.
Both machines run a gateway AND connect to the other as a node. Either can delegate work to the other.
The Monitoring Stack
- Health checks every few minutes, auto-discovering services
- Multiple monitoring probes with instant chat alerts
- Workflow automations for usage tracking and cost threshold alerts
- Daily backups with auto-pruning retention
I added monitoring after complexity and had a few blind spots early on. If I did this again, I’d set up monitoring first, then add agents. You should know when things break before you notice.
Security
This isn’t a toy. Real infrastructure needs real security.
I won’t go into implementation specifics (for obvious reasons), but the posture includes:
- Key-only authentication with intrusion detection
- Firewall with strict policies
- All services behind a reverse proxy — zero directly exposed ports
- API keys stored securely with automated rotation on a schedule
- Bot access restricted to allowlists
- A dedicated read-only security agent running continuous automated scans
- Per-agent tool restrictions and execution policies
What It Costs
| Item | Monthly Cost |
|---|---|
| VPS | ~$12 |
| API usage | Variable |
| Free-tier model | $0 |
| Search API | ~$25 |
| Infrastructure total | ~$37 + API |
The routing optimization is the cost control. Most messages — trivial, lookups, simple questions — hit free or near-free models. Only genuinely complex tasks touch expensive ones.
What I’d Build Differently
-
Start with monitoring, not agents. Add observability before adding complexity.
-
One agent, one job. Separation of concerns applies to AI agents just like it does to microservices.
-
Circuit breakers from day one. Even if you trust your code, add them.
-
Provider diversity isn’t optional. Every major AI API provider has had outages recently. If your system depends on one provider, your system has a single point of failure.
Getting Started
If you want to build something like this, here’s the progression:
- Week 1: One agent, one channel, one model. Get the basics working.
- Week 2: Add a second agent with a different specialty. Add routing between them.
- Week 3: Add monitoring and alerting. You should know when things break before you notice.
- Week 4: Add provider failover and credit limits. Now you’re production-ready.
The “Claude + Obsidian” approach is step 0.5. It’s a great place to start. But if you want AI that actually operates — not just assists — you need the full stack.
Source
Full architecture doc and thread draft on GitHub.