Batiste is a zero-trust runtime for AI agents. Every tool call passes scope → auth → audit before it reaches a handler, lands in an append-only ledger with per-call token attribution, and work is only done when the gate is green. Runs on your own infra, MCP-native. The ready edge is the signed ledger + orchestration + discipline gates — not token savings, not whole verticals.
What it is
Not another harness that makes an agent run. The layer that makes the agent's work verifiable: governance, observability, audit-by-construction, and gates that actually fail the build. The brain stays external by design — no LLM SDK in the request path — so every action is attestable regardless of which model decided to call it. Three pillars, each shell-verifiable today.
scope (path-scoped deny) → auth (JWT) → audit (SQLite WAL) is the path of every call, not optional middleware. A call that skips it never reaches the handler. A kill switch revokes every node network-wide in microseconds. MCP-native — existing tools plug in without rewrite.
packages/{scope,auth,audit,transport} · self-hosted, zero required cloudAppend-only log records every tool call with timing, result, model, provider and token usage — the action is on the record before it runs. Documents carry sha256 + recomputed sha256_signed, so an artefact traces back to its inputs and its SHA.
packages/audit/src/ledger.ts · 395 tests on real SQLite :memory:"Done = green gate, not memory." repo-hygiene, deploy-lineage, dfam-preflight, deploy-smoke, pitch-integrity. Each returns exit 1 on failure — including a gate whose only job is to fail a demo that fakes numbers.
packages/{repo-hygiene,deploy-lineage,pitch-integrity}The product question
A decision-maker running agents wants to know: what did it cost, can I prove it, was it gated, how fast, and can I reproduce it. Each KPI is real and measurable — and ships with the file or command that proves it.
| KPI | What it measures | How Batiste measures it |
|---|---|---|
| Token consumption observability cost attribution |
Input / output / cached tokens per agent, per task, per workflow, with model_id and provider. Turns a mystery invoice into a per-agent line-item you can attribute, forecast and cap. | audit_log columns input_tokens / output_tokens / cached_tokens / model_id / provider; tokenTotals() SUMs only rows where model_id IS NOT NULL — the measured provider payload. Unmeasured calls cannot dilute or inflate it. Never a chars/4 estimate, never a savings claim. |
| Audit coverage shipped | % of agent actions / outputs written to the signed append-only ledger. The difference between an agent you can put in front of an auditor and one you cannot. | SQLite WAL ledger records every tool call (timestamp / session_id / agent_id / tool / args / result / duration_ms). A compliance report aggregates total / success / denied / error, unique agents, unique sessions, top tools over any window. Integrity chain survives redaction. |
| Gate-discipline shipped | % of deliveries that passed a green gate before being marked done. "Done" is a machine returning green, not a human remembering to check. | Executable scripts returning exit 1 on failure: repo-hygiene.mjs (git-clean / no-dup-hash / canonical), deploy-lineage.mjs (refuses an untracked or dirty source, stamps the commit SHA), dfam-preflight.mjs. pitch-integrity closes the door on theatre. |
| Orchestration efficiency shipped | Agents run in parallel and wall-clock vs the sequential equivalent. The cost isn't just tokens — it's time-to-result. | DAG executor (zero npm deps): toposort + ready-queue + per-tag concurrency pools (concurrency:'auto' = floor(cpus/2); concurrencyByTag reads the API tier). Upstream failure skips only dependents; independent branches keep running, partial results stay usable. 21 tests passing. |
| Provenance / reproducibility in progress | Any output traced back to inputs + model + SHA. Reproducibility is what separates "AI output" from "evidence". | Shipped: artefact → SHA → parent_ref; the router is pure and deterministic (same task in → same routing decision out) and stamps its model×task choice on the ledger; the executor emits per-operation lineage. Roadmap: end-to-end re-run that matches bit-for-bit — labelled intent, not shipped. |
Proof, not promise
// numbers below: meta-metrics measured in this session — counts of agents, tokens, gates and tasks; an example of one day's work, not a published benchmark
The meta-narrative: the runtime turned the lens on the house's own work — every agent metered, every gate enforced, every shipped deploy hash-matched against its source. Nothing was applied without passing its gate. The pitch is the behaviour, not the slide.
Why it's defensible
packages/audit/src/{ledger.ts,redaction.ts}
packages/pitch-integrity/ · node deploy-lineage.mjs
packages/{atrium,memory,router}
The vision
Batistetize
the world.
Every agent you run is metered, attributed, audited and gated — or it isn't done. The harness era proved agents can act. The next era proves they can be trusted. That layer is governance + observability + discipline + audit-by-construction. That layer is Batiste.