Skip to content

Cost model

Colibri tracks every token that passes through an agent session and meters cost against a configurable budget. The key insight: cache-hit tokens cost 10× less than fresh tokens on DeepSeek — so the prompt prefix is engineered to be byte-stable across requests, maximizing cache hits. Three cost modes (fast, smart, max) represent different points on the speed/cost trade-off, and the model auto-escalates when a cheaper mode can’t keep up.

Byte-stable prompt prefix → cache-hit metering

Section titled “Byte-stable prompt prefix → cache-hit metering”

The system prompt and early context blocks are byte-for-byte identical across consecutive requests to the same DeepSeek endpoint. DeepSeek’s cache-hit pricing discounts these by ~90%. Colibri’s colibri-deepseek probe determines the exact token-count split between cached and fresh tokens per request, and the cost tracker records both so the session budget reflects the actual discounted cost, not the nominal token count.

Why not just count tokens: token counting with an offline tokenizer gives you an upper bound but not the real cost. DeepSeek’s API sometimes re-caches and sometimes doesn’t — the probe measures what actually happened. The discount is too large (10×) to leave unmeasured.

headroom-sidecar, COLIBRI-TOKENOMICS-TRIFECTA.md, crates/colibri-deepseek/src/lib.rs

ModeBudget (tokens)Behavior
Fast16KMaximum cache-hits, minimum fresh tokens. Rejects large expansions early.
Smart64KDefault. Balances cache reuse with room for follow-up turns.
Max256KAlmost never hits budget. For one-shot deep tasks where cost is secondary.

The daemon auto-escalates when a session exhausts its budget in a lower mode: fast → smart → max. Escalation is one-way (never downgrades mid-session).

Why three modes, not a continuous slider: simplicity wins here. Three well-understood points cover the space — operators pick by risk appetite, not by fine-tuning a number. The escalation chain means “start cheap, pay more only if it works.”

COLIBRI-TOKENOMICS-TRIFECTA.md, crates/colibri-daemon/src/cost.rs

T14 compaction (budget trim, not truncate)

Section titled “T14 compaction (budget trim, not truncate)”

When a session is about to exceed its budget, Colibri compacts the tool results in the volatile region — it sends them through the headroom sidecar for summarization, then trims the oldest volatile blocks until the prompt fits within budget. The prefix (system prompt, static context) is never trimmed — only the volatile suffix.

If compaction is insufficient and auto-escalation is enabled, the mode steps up before truncating.

Why not just truncate: truncating mid-conversation loses context the agent needs to continue. Compaction preserves the semantic content at lower token cost. The headroom sidecar is optional (off by default); without it, the fallback is simple truncation.

headroom-sidecar, crates/colibri-daemon/src/session.rs

The colibri-deepseek crate sends a preflight request with a known prompt to the DeepSeek API and parses the response headers to determine the cache-hit split (prompt_cache_hit_tokens / prompt_cache_miss_tokens). This is provider-specific — DeepSeek is the only provider that exposes this granularity. The probe runs once per session configuration change, not per request.

Why a probe and not a hook: middleware that intercepts every API response would couple cost tracking to the HTTP layer. A probe decouples it — the cost tracker asks “what was the cache ratio?” and the probe answers, independently of how the request was made.

crates/colibri-deepseek/src/lib.rs

  • task-board — the scheduler that dispatches tasks within session budgets
  • mother-hive — MCP architecture (different cost domain)
  • quality-gates — the gate that validates cost-mode parsing