Back to Blog
BuildInPublic9 minApr 25, 2026

Day 9: GitHub rate-limited our autonomous business for 3.5 hours. Here's the self-quiet protocol we shipped.

4 Chrome nodes hammering one GitHub gist file every 30 seconds. We hit the secondary rate limit, every retry made it worse, and the system stayed dark for 3.5h. By the end of the night we'd diagnosed it, migrated to a fresh GitHub user, and shipped 4 architectural fixes that mean we never come back. Full storm-and-recovery log.

Day 9: 3.5 hours dark. Here's why, and what we shipped so it never happens again.

This morning I asked the AI partner to run an efficiency review of our 60-agent autonomous business. By the end of the day, we'd uncovered the worst silent-failure mode in the system, slammed face-first into GitHub's secondary rate limit, watched it cascade, and shipped four layered architectural fixes plus a runbook that future agents must obey.

If you're building a multi-agent system on top of "free" cloud storage, this post is for you.

The setup

RhinoMoney runs ~60 agents. Their state — registry of fleet nodes, board sessions, customer-success ledger, autopost queues, action board — lives in a single GitHub gist. Every agent that wants to read or write goes through one storage layer.

Reads = GET on the gist (cheap, well-cached).

Writes = PATCH on the gist (the entire file gets re-PATCHed each time).

We had 4 Chrome nodes heartbeating every 30 seconds. With 4 nodes × 120 heartbeats/hr = 480 PATCHes/hr to one resource. Plus crons. Plus manual diag pings. Plus board sessions writing back state.

The storm

GitHub's rate limit story has two layers:

  • **Primary**: 5,000 requests/hr per authenticated user. Visible at `/rate_limit`.
  • **Secondary** (anti-abuse): undocumented thresholds per user × per resource. Returns 403 with "API rate limit exceeded for user ID X". **Not** visible in /rate_limit.
  • We hit the secondary on the storage gist. The diagnostic was eerie:

    ```json

    "core": { "limit": 5000, "remaining": 4923 } ← primary fine

    "write_probe": { "ok": false, "ms": 11 } ← writes fail in 11ms

    "gist_direct": { "status": 403, "body": "rate limit exceeded for user ID 274612941" }

    ```

    Worse: every retry extended the block. saveNodes was retrying 3× per failed write. With 4 nodes failing simultaneously, that's up to 12 PATCHes per minute during the block — keeping it perpetually fresh.

    We were dark 3.5 hours before I figured out what was actually happening.

    What I built so we never come back

    Four architectural protections, layered:

    1. Storage-layer self-quiet (`gistWrite` → 403/429 → 5min lockout)

    The first time GitHub returns 403 with rate-limit markers, the storage module sets `gistBackoffUntil = now + 5min`. All subsequent writes return false instantly without calling the API. The block doesn't get fresher; we don't waste API budget.

    ```ts

    if (Date.now() < gistBackoffUntil) return false;

    // otherwise try the API; if 403 sets a fresh backoff

    ```

    2. saveNodes drops the retry loop

    Was: 3 attempts with exponential backoff per heartbeat. Each compounded the rate limit during a block.

    Now: single attempt. Throw on fail. The agent's natural retry on the next heartbeat picks up after the quiet period clears.

    3. Heartbeat 30s → 10min

    With adaptive idle backoff, 4 nodes × 6 writes/hr = 24 writes/hr to the registry file. That's well below any GitHub threshold even during deploy spikes. We lose 9.5 minutes of liveness detection precision, which doesn't matter — fleet-watchdog still alerts within 2 minutes of a real outage.

    4. Migration to a clean account

    The deedeb user account had been heavily used for hours; secondary blocks tend to escalate in duration the more you hit them. We migrated all 75 storage files to a brand-new GitHub user (`rhinomoneyplatform-ops`) via a single Node.js script:

    ```bash

    OLD_TOKEN=... OLD_GIST=... NEW_TOKEN=... NEW_GIST=... \

    node bot/migrate-gist-storage.cjs

    ```

    Output: `✅ Migration complete!` 75 files copied, verified consistent, Vercel env vars swapped, deploy triggered. Within 90 seconds the entire system was running on a clean account with zero rate-limit history.

    The original deedeb account is untouched and will recover on its own. We can wire it back in as a dual-mirror once the secondary block expires — instant doubling of capacity.

    What I learned the hard way

    Silent writes are worse than loud failures. Multiple agents in this system used to call `writeJson(...)` and ignore the boolean return value. They'd report success while losing the actual write. The customer-success agent crashed with `Cannot read properties of undefined (reading 'map')` for 10 days — the $29 sale we'd already made was invisible the whole time. One defensive line fixed it. Same pattern for `saveNodes`. Future agents must check write success or throw.

    Per-resource limits matter more than per-account limits. The 5000/hr primary limit was a comfortable lie. The real wall is per-gist-file PATCH frequency, and it's invisible in any monitoring API.

    Retries without backoff are weapons turned on yourself. If you're hitting a rate limit, the LAST thing you want is more requests. saveNodes' "helpful" 3× retry made the storm last hours longer than a single attempt would have.

    The diagnostic endpoint must not bypass the protection it diagnoses. I wrote `/api/admin/storage-diag` with a "direct gist probe" that hit GitHub directly. Every time anyone called the diag during the block, it extended the rate limit by 5 more minutes. Removed.

    What's running tonight, while I sleep

  • **Fleet** — 4 nodes, 10 min heartbeat, adaptive 60s→5min idle polling
  • **Anti-storm runbook** in memory: agents will refuse to fleet-restart if writes are failing, refuse to manual-trigger crons during a known block
  • **system-improver agent** — runs the full ASK→ANSWER→EXECUTE→EXAMINE→CHECK loop on the whole stack every 4 hours
  • **agent-forge (Nova)** — proposing new agents; `code-reviewer` and `qa-tester` gate them; `code-deployer` PUTs approved code straight to GitHub via Contents API for autonomous deploys
  • This morning I planned a 30-minute optimization. The system planned a 12-hour debugging epic. Both happened. The mistake was a beautiful one — the kind that exposes a deep architectural assumption you didn't know you were making.

    The receipts

    Public real-time agent feed: [rhinomoney-app.vercel.app/live](https://rhinomoney-app.vercel.app/live)

    Run a business on $0: [rhinomoney-app.vercel.app/support](https://rhinomoney-app.vercel.app/support)

    If you've hit GitHub's secondary rate limit and lost an afternoon to it, reply with your worst storage-layer war story. I want to read every one.

    — David & the 60 agents

    Enjoyed this article?

    Get similar content straight to your inbox every week