Servonaut AI: a hosted gateway with dollar budgets, not token quotas

Zoltan Nagy

May 8, 2026

Servonaut AI is the hosted AI gateway included on Solo and Teams plans. Hit F2 in the TUI, log in, and start chatting with your fleet — no personal API key needed, no provider account to set up. The thing nobody else seems to do: the gate is a real dollar budget, not a per-request token quota that loses meaning the day a vendor changes pricing.

This post is the design rationale: why dollar-budget enforcement, what fails over to what, and what we deliberately gave up.

What it is

The chat panel routes through mcp.servonaut.dev, which proxies the request to the cheapest viable provider in a failover chain:

Anthropic Haiku  →  OpenAI gpt-4o-mini  →  Gemini Flash  →  Ollama Cloud

If a provider 5xx's, rate-limits, or returns an empty body, the gateway transparently falls over to the next one. Your CLI sees one logical provider; the routing decision is invisible.

Each tier has a monthly dollar budget (free=$0, Solo=$N, Teams=$M/seat — see pricing). When you exhaust the budget, you can buy one-time top-up packs that don't expire for 12 months. The CLI shows the remaining balance inline in the chat panel and via servonaut ai quota.

Why dollars, not tokens

Token quotas have one nice property: they're the unit the provider bills in. They have one terrible property: they have no meaning to a human.

If we said "Solo gets 200,000 tokens per month", the questions that immediately follow are:

Is that input tokens or output tokens?
For Claude Haiku or Sonnet?
What about cached prompts?
Does a 20MB log analysis with prompt_caching_beta count the same as a chat exchange?

The answer is all of the above and none of the above. Token semantics drift across providers, across models in the same provider, and across cache states. So we sidestepped: the AiBudgetService::preflightCheck tracks dollars spent against your budget. Pricing tables for each provider are seeded into the database via app:seed-ai-config; admins can update them without redeploying when vendors change prices (which they do, constantly).

The trade-off: we give up the ability to give you an exact "tokens left" count. You get a dollar amount that means the same thing tomorrow as it did yesterday. We think that's the right call.

Two enforcement modes

The system has three modes — off, shadow, enforce — controlled by AI_BUDGET_ENFORCEMENT:

off — preflight skipped entirely.
shadow — preflight runs and logs what it would have decided, but always allows. Used for the initial rollout to validate the math.
enforce — the real gate. Hard cap blocks the request; soft-cap downgrades to a cheaper tier (usually Flash) and lets it through with a warning.

We shipped enforce mode in PR #43 after running shadow for a few weeks and watching the logs match what we'd expect. The kill-switch path is one env var: set AI_BUDGET_ENFORCEMENT=shadow and docker compose restart php — done. Total rollback in ~15 seconds without a redeploy.

What we don't do

A few things we said no to, all on purpose:

No per-request token cap. A single Claude call against a 12MB log payload is fine; we cap the dollar amount, not the call. (PR #44 lifted the request body cap from 2MB to 12MB precisely because the dollar budget makes the request-size cap redundant.)
No "fair use" rate limit on top of the budget. If you have $5 of Solo budget left and want to spend it all in a 30-minute marathon, you can. Concurrency is per-user (PR #42 added that — protects neighbours, not us), but throughput within a session is just whatever the provider lets through.
No silent re-routing. If your call falls over from Anthropic to OpenAI, the SSE stream emits a leading event with the actual model_used. The CLI shows it in the panel header. You'll never wonder which model just answered you.

The kill-switch

Production has one master kill: PREMIUM_AI_ENABLED=false in the shared Caddy .env, container restart, and /api/ai/* returns 503 immediately. We've never had to use it in anger; it exists because we want it to exist before we need it.

Free tier — by design, not by accident

Free-tier users get $0 of hosted budget. That's intentional, not a bug:

The free CLI is MIT-licensed and unrestricted. You can wire up your own Anthropic / OpenAI / Gemini / Ollama key in config.json and chat to your heart's content.
The hosted gateway is the value-add of the paid tier. It's the one feature where "give it away free" runs us a real, monthly bill we can't claw back.

If you want hosted AI without paying, run a local Ollama install and point the CLI at it. That works on the free tier and costs us nothing — which means we're happy to keep recommending it.

Trying it

servonaut login from the CLI, hit F2, type something. The chat-panel header shows your remaining balance. Solo and Teams plans on the pricing page. Free tier instructions on the quickstart page.