Batiste — the runtime that governs AI agents

Why now

Capability is outrunning oversight.

AI incidents are climbing, while there is still no standard way to measure whether an agent behaved. That gap — between what AI does and what an organisation can verify — is the layer Batiste occupies.

233 → 362

AI incidents, 2024 → 2025
362 recorded in 2025, up from 233 in 2024 (+55%) — and only the publicly reported ones.

Stanford HAI · AI Index 2026 · Responsible AI

“sparse”

responsible-AI benchmarking
Reporting on responsible-AI benchmarks “remains sparse” — while capability benchmarks (MMLU, SWE-bench) are near-universal. The field measures what a model can do, not whether its behaviour is auditable.

Stanford HAI · AI Index 2026 · Responsible AI

// market data (Stanford HAI · AI Index 2026) — it sizes the problem; these are not Cachola Tech metrics.

hai.stanford.edu/ai-index ↗

What it is

Operational discipline, by construction.

Not another harness that makes an agent run. The layer that makes the agent's work verifiable: governance, observability, audit-by-construction, and gates that actually fail the build. The brain stays external by design — no LLM SDK in the request path — so every action is attestable regardless of which model decided to call it. Three pillars, each shell-verifiable today.

Zero-trust chain

scope (path-scoped deny) → auth (JWT) → audit (SQLite WAL) is the path of every call, not optional middleware. A call that skips it never reaches the handler. A kill switch revokes every node network-wide in microseconds. MCP-native — existing tools plug in without rewrite.

packages/{scope,auth,audit,transport} · self-hosted, zero required cloud

Signed audit ledger

Append-only log records every tool call with timing, result, model, provider and token usage — the action is on the record before it runs. Documents carry sha256 + recomputed sha256_signed, so an artefact traces back to its inputs and its SHA.

packages/audit · 67 tests · signed ledger on real SQLite (better-sqlite3)

Gates as code

"Done = green gate, not memory." repo-hygiene, deploy-lineage, dfam-preflight, deploy-smoke, pitch-integrity. Each returns exit 1 on failure — including a gate whose only job is to fail a demo that fakes numbers.

packages/{repo-hygiene,deploy-lineage,pitch-integrity}

The product question

Five KPIs Batiste suggests to govern AI agents.

A decision-maker running agents wants to know: what did it cost, can I prove it, was it gated, how fast, and can I reproduce it. Each KPI is real and measurable — and ships with the file or command that proves it.

KPI	What it measures	How Batiste measures it
Token consumption observability cost attribution	Input / output / cached tokens per agent, per task, per workflow, with model_id and provider. Turns a mystery invoice into a per-agent line-item you can attribute, forecast and cap.	audit_log columns input_tokens / output_tokens / cached_tokens / model_id / provider; tokenTotals() SUMs only rows where model_id IS NOT NULL — the measured provider payload. Unmeasured calls cannot dilute or inflate it. Never a chars/4 estimate, never a savings claim.
Audit coverage shipped	% of agent actions / outputs written to the signed append-only ledger. The difference between an agent you can put in front of an auditor and one you cannot.	SQLite WAL ledger records every tool call (timestamp / session_id / agent_id / tool / args / result / duration_ms). A compliance report aggregates total / success / denied / error, unique agents, unique sessions, top tools over any window. Integrity chain survives redaction.
Gate-discipline shipped	% of deliveries that passed a green gate before being marked done. "Done" is a machine returning green, not a human remembering to check.	Executable scripts returning exit 1 on failure: repo-hygiene.mjs (git-clean / no-dup-hash / canonical), deploy-lineage.mjs (refuses an untracked or dirty source, stamps the commit SHA), dfam-preflight.mjs. pitch-integrity closes the door on theatre.
Orchestration efficiency shipped	Agents run in parallel and wall-clock vs the sequential equivalent. The cost isn't just tokens — it's time-to-result.	DAG executor (zero npm deps): toposort + ready-queue + per-tag concurrency pools (concurrency:'auto' = floor(cpus/2); concurrencyByTag reads the API tier). Upstream failure skips only dependents; independent branches keep running, partial results stay usable. 21 tests passing.
Provenance / reproducibility in progress	Any output traced back to inputs + model + SHA. Reproducibility is what separates "AI output" from "evidence".	Shipped: artefact → SHA → parent_ref; the router is pure and deterministic (same task in → same routing decision out) and stamps its model×task choice on the ledger; the executor emits per-operation lineage. Roadmap: end-to-end re-run that matches bit-for-bit — labelled intent, not shipped.

shipped · shell-verifiable in progress · roadmap observability (not savings)

Proof, not promise

This session is the living proof.

// numbers below: meta-metrics measured in this session — counts of agents, tokens, gates and tasks; an example of one day's work, not a published benchmark

19 agents

one orchestrated workflow · ~1.08M tokens metered · 314 tool-uses

orchestrate_agents · this session

15 agents

a second parallel workflow · ~1.17M tokens metered

orchestrate_agents · this session

5 agents

a third workflow · ~300k tokens metered · plus several builds

orchestrate_agents · this session

74 tasks

durable state via manage_task (44 done / 29 pending / 1 failed) — not a transcript

sqlite3 .batiste/tasks.db 'SELECT status,COUNT(*) …'

exit 1

gates fired live — a repo-hygiene gate blocked, a deploy-lineage gate refused an invalid deploy

node packages/repo-hygiene/hygiene.mjs · deploy-lineage.mjs

1,175

signed lineage entries (sha256 + sha256_signed)

wc -l .audit/document-audit.jsonl

The meta-narrative: the runtime turned the lens on the house's own work — every agent metered, every gate enforced, every shipped deploy hash-matched against its source. Nothing was applied without passing its gate. The pitch is the behaviour, not the slide.

Why it's defensible

The honest edge.

The signed ledger Append-only SQLite WAL, every tool-call with usage / model / provider. Documents with sha256 + recomputed sha256_signed. The SHA-256 is the integrity witness on redaction (GDPR Art. 17) — it preserves the chain when a payload is erased, rather than deleting the row. packages/audit/src/{ledger.ts,redaction.ts}
Gates as code Discipline that fails the build for real. pitch-integrity is the standout: a gate that rejects a demo with random generators, fixed multipliers, hard-coded result:'success', or "production-ready" without proof. packages/pitch-integrity/ · node deploy-lineage.mjs
Firm Memory, on your infra Air-gapped firm memory; every mutation is audit-emitted before commit. The IP lives in the deployment, never a hosted service. The router is local-first and deterministic — it picks the model and stamps the choice, but does not itself call a model; the host returns measured usage. packages/{atrium,memory,router}

The vision

Batistetize
the world.

Every agent you run is metered, attributed, audited and gated — or it isn't done. The harness era proved agents can act. The next era proves they can be trusted. That layer is governance + observability + discipline + audit-by-construction. That layer is Batiste.