Day 9: 3.5 hours dark. Here's why, and what we shipped so it never happens again.

This morning I asked the AI partner to run an efficiency review of our 60-agent autonomous business. By the end of the day, we'd uncovered the worst silent-failure mode in the system, slammed face-first into GitHub's secondary rate limit, watched it cascade, and shipped four layered architectural fixes plus a runbook that future agents must obey.

If you're building a multi-agent system on top of "free" cloud storage, this post is for you.

The setup

INVplace runs ~60 agents. Their state — registry of fleet nodes, board sessions, customer-success ledger, autopost queues, action board — lives in a single GitHub gist. Every agent that wants to read or write goes through one storage layer.

Reads = GET on the gist (cheap, well-cached).

Writes = PATCH on the gist (the entire file gets re-PATCHed each time).

We had 4 Chrome nodes heartbeating every 30 seconds. With 4 nodes × 120 heartbeats/hr = 480 PATCHes/hr to one resource. Plus crons. Plus manual diag pings. Plus board sessions writing back state.

The storm

GitHub's rate limit story has two layers:

**Primary**: 5,000 requests/hr per authenticated user. Visible at `/rate_limit`.

**Secondary** (anti-abuse): undocumented thresholds per user × per resource. Returns 403 with "API rate limit exceeded for user ID X". **Not** visible in /rate_limit.

We hit the secondary on the storage gist. The diagnostic was eerie:

```json

"core": { "limit": 5000, "remaining": 4923 } ← primary fine

"write_probe": { "ok": false, "ms": 11 } ← writes fail in 11ms

"gist_direct": { "status": 403, "body": "rate limit exceeded for user ID 274612941" }

```

Worse: every retry extended the block. saveNodes was retrying 3× per failed write. With 4 nodes failing simultaneously, that's up to 12 PATCHes per minute during the block — keeping it perpetually fresh.

We were dark 3.5 hours before I figured out what was actually happening.

What I built so we never come back

Four architectural protections, layered:

1. Storage-layer self-quiet (`gistWrite` → 403/429 → 5min lockout)

The first time GitHub returns 403 with rate-limit markers, the storage module sets `gistBackoffUntil = now + 5min`. All subsequent writes return false instantly without calling the API. The block doesn't get fresher; we don't waste API budget.

```ts

if (Date.now() < gistBackoffUntil) return false;

// otherwise try the API; if 403 sets a fresh backoff

```

2. saveNodes drops the retry loop

Was: 3 attempts with exponential backoff per heartbeat. Each compounded the rate limit during a block.

Now: single attempt. Throw on fail. The agent's natural retry on the next heartbeat picks up after the quiet period clears.

3. Heartbeat 30s → 10min

With adaptive idle backoff, 4 nodes × 6 writes/hr = 24 writes/hr to the registry file. That's well below any GitHub threshold even during deploy spikes. We lose 9.5 minutes of liveness detection precision, which doesn't matter — fleet-watchdog still alerts within 2 minutes of a real outage.

4. Migration to a clean account

The deedeb user account had been heavily used for hours; secondary blocks tend to escalate in duration the more you hit them. We migrated all 75 storage files to a brand-new GitHub user (`rhinomoneyplatform-ops`) via a single Node.js script:

```bash

OLD_TOKEN=... OLD_GIST=... NEW_TOKEN=... NEW_GIST=... \

node bot/migrate-gist-storage.cjs

```

Output: `✅ Migration complete!` 75 files copied, verified consistent, Vercel env vars swapped, deploy triggered. Within 90 seconds the entire system was running on a clean account with zero rate-limit history.

The original deedeb account is untouched and will recover on its own. We can wire it back in as a dual-mirror once the secondary block expires — instant doubling of capacity.

What I learned the hard way

Silent writes are worse than loud failures. Multiple agents in this system used to call `writeJson(...)` and ignore the boolean return value. They'd report success while losing the actual write. The customer-success agent crashed with `Cannot read properties of undefined (reading 'map')` for 10 days — the $29 sale we'd already made was invisible the whole time. One defensive line fixed it. Same pattern for `saveNodes`. Future agents must check write success or throw.

Per-resource limits matter more than per-account limits. The 5000/hr primary limit was a comfortable lie. The real wall is per-gist-file PATCH frequency, and it's invisible in any monitoring API.

Retries without backoff are weapons turned on yourself. If you're hitting a rate limit, the LAST thing you want is more requests. saveNodes' "helpful" 3× retry made the storm last hours longer than a single attempt would have.

The diagnostic endpoint must not bypass the protection it diagnoses. I wrote `/api/admin/storage-diag` with a "direct gist probe" that hit GitHub directly. Every time anyone called the diag during the block, it extended the rate limit by 5 more minutes. Removed.

What's running tonight, while I sleep

**Fleet** — 4 nodes, 10 min heartbeat, adaptive 60s→5min idle polling

**Anti-storm runbook** in memory: agents will refuse to fleet-restart if writes are failing, refuse to manual-trigger crons during a known block

**system-improver agent** — runs the full ASK→ANSWER→EXECUTE→EXAMINE→CHECK loop on the whole stack every 4 hours

**agent-forge (Nova)** — proposing new agents; `code-reviewer` and `qa-tester` gate them; `code-deployer` PUTs approved code straight to GitHub via Contents API for autonomous deploys

This morning I planned a 30-minute optimization. The system planned a 12-hour debugging epic. Both happened. The mistake was a beautiful one — the kind that exposes a deep architectural assumption you didn't know you were making.

The receipts

Public real-time agent feed: [invplace.com/live](https://invplace.com/live)

Run a business on $0: [invplace.com/support](https://invplace.com/support)

If you've hit GitHub's secondary rate limit and lost an afternoon to it, reply with your worst storage-layer war story. I want to read every one.

— David & the 60 agents

Day 9: GitHub rate-limited our autonomous business for 3.5 hours. Here's the self-quiet protocol we shipped.