Cost model

What this is

Colibri tracks every token that passes through an agent session and meters cost against a configurable budget. The key insight: cache-hit tokens cost 10× less than fresh tokens on DeepSeek — so the prompt prefix is engineered to be byte-stable across requests, maximizing cache hits. Three cost modes (fast, smart, max) represent different points on the speed/cost trade-off, and the model auto-escalates when a cheaper mode can’t keep up.

Decisions

Byte-stable prompt prefix → cache-hit metering

The system prompt and early context blocks are byte-for-byte identical across consecutive requests to the same DeepSeek endpoint. DeepSeek’s cache-hit pricing discounts these by ~90%. Colibri’s colibri-deepseek probe determines the exact token-count split between cached and fresh tokens per request, and the cost tracker records both so the session budget reflects the actual discounted cost, not the nominal token count.

Why not just count tokens: token counting with an offline tokenizer gives you an upper bound but not the real cost. DeepSeek’s API sometimes re-caches and sometimes doesn’t — the probe measures what actually happened. The discount is too large (10×) to leave unmeasured.

→ headroom-sidecar, COLIBRI-TOKENOMICS-TRIFECTA.md, crates/colibri-deepseek/src/lib.rs

Three cost modes (fast → smart → max)

Mode	Budget (tokens)	Behavior
Fast	16K	Maximum cache-hits, minimum fresh tokens. Rejects large expansions early.
Smart	64K	Default. Balances cache reuse with room for follow-up turns.
Max	256K	Almost never hits budget. For one-shot deep tasks where cost is secondary.

The daemon auto-escalates when a session exhausts its budget in a lower mode: fast → smart → max. Escalation is one-way (never downgrades mid-session).

Why three modes, not a continuous slider: simplicity wins here. Three well-understood points cover the space — operators pick by risk appetite, not by fine-tuning a number. The escalation chain means “start cheap, pay more only if it works.”

→ COLIBRI-TOKENOMICS-TRIFECTA.md, crates/colibri-daemon/src/cost.rs

T14 compaction (budget trim, not truncate)

When a session is about to exceed its budget, Colibri compacts the tool results in the volatile region — it sends them through the headroom sidecar for summarization, then trims the oldest volatile blocks until the prompt fits within budget. The prefix (system prompt, static context) is never trimmed — only the volatile suffix.

If compaction is insufficient and auto-escalation is enabled, the mode steps up before truncating.

Why not just truncate: truncating mid-conversation loses context the agent needs to continue. Compaction preserves the semantic content at lower token cost. The headroom sidecar is optional (off by default); without it, the fallback is simple truncation.

→ headroom-sidecar, crates/colibri-daemon/src/session.rs

Cache-hit probe (DeepSeek-specific)

The colibri-deepseek crate sends a preflight request with a known prompt to the DeepSeek API and parses the response headers to determine the cache-hit split (prompt_cache_hit_tokens / prompt_cache_miss_tokens). This is provider-specific — DeepSeek is the only provider that exposes this granularity. The probe runs once per session configuration change, not per request.

Why a probe and not a hook: middleware that intercepts every API response would couple cost tracking to the HTTP layer. A probe decouples it — the cost tracker asks “what was the cache ratio?” and the probe answers, independently of how the request was made.

→ crates/colibri-deepseek/src/lib.rs