The thesis — provable engineering in the age of agents

1. The problem

Pipelines don't produce proof

A modern delivery pipeline is, in practice, a coalition of loosely coupled tools: CI runners that clone the repo and execute scripts, pull-request workflows that leave a checkmark in a database, static analysis scanners that write reports to object storage, artifact registries that timestamp a push event, and ticketing systems that hold the requirement the change was supposed to address. Each of these tools is authoritative about its own fragment of the story. None of them can see the others. None of them compose.

When an auditor — or a security incident, or a supply-chain disclosure — asks you to show exactly what happened to a specific version, you reconstruct the answer from scattered, expiring, editable sources. The CI log from three weeks ago may already be gone. The PR approval is a green checkmark and a timestamp — it doesn't record which revision of the code the reviewer actually looked at, and it can be overwritten. The SBOM sits in a bucket, unsigned, with no cryptographic binding to the commit tree that produced it. You may have followed every prescribed step. The record cannot prove it.

This is not a tooling gap — most teams have abundant tooling. It is a structural gap: no single, tamper-evident record is ever bound to a specific commit that says "this exact version was tested by these checks, reviewed by these people, scanned by these tools, and approved for release." The practice may be sound; the proof is missing — and nothing about the reconstructed record is tamper-evident.

Agents break the human assumptions

Most current practice — the code review, the merge gate, the deploy approval — quietly assumes a human in the loop. Not because the process was designed that way; rather, because the controls were designed by humans, for a pace of work that humans set. A trusted engineer reviewed the diff. A known colleague clicked approve. "Everyone knows" what the system does and what the change is for. That ambient, informal context is load-bearing — and it is almost entirely invisible to the formal process.

Autonomous coding agents break those assumptions in a specific and compounding way. Volume explodes: an agent working overnight can produce more diffs in six hours than a developer produces in a sprint. "Who did what" blurs: the agent acts under a human's identity, or under its own service account, or under both at once, depending on which tool invoked it. The familiar social cues that let a reviewer quickly calibrate trust — "oh, this is the kind of change Jonas makes" — evaporate. And the agent has no fatigue, no intuition to pause when something feels off, and no authority to deviate from its instruction.

Controls that worked because a trusted human was doing the work must become explicit, machine-checkable, and bound to evidence — or they will not hold. The question is not whether to admit agents into the development process; the question is whether your controls are expressed in a form that means anything when the agent is the actor.

CRA makes it law

The EU Cyber Resilience Act extends the logic of product liability to software. Every manufacturer of a product with digital elements — meaning almost any software that ships — carries obligations that run across the full product lifecycle: secure-by-design requirements, vulnerability handling with defined response windows, and the obligation to demonstrate that those requirements were met. "Demonstrate" is doing significant work in that sentence: the CRA does not accept attestation by assertion.

Annex I of the CRA enumerates the security requirements. For each release, you must be able to show — as durable, verifiable evidence, not a screenshot of a green check — who did what and which quality controls ran. A conformity assessment body will ask for documentation of your development process; more importantly, it will ask for evidence that the process was followed for the specific version in question. A scattered manual pipeline cannot satisfy that. An agent-driven future, with its higher velocity and diffuse authorship, makes the gap urgent.

Taken together, provability, agent-safety, and regulatory compliance are not three separate problems requiring three separate investments. They are the same underlying need — a tamper-evident record of what happened, bound to the code, available for verification by any party with the public key — expressed in three different registers: operational, architectural, and legal.

2. The hypothesis

The route to safe, successful agent-driven development runs through sound engineering practice, not around it. The three disciplines that define sound practice — executable specs, trunk-based development, and continuous delivery — are not just good engineering hygiene. They are, it turns out, the exact structural prerequisites for producing a coherent, tamper-evident record of what happened to a piece of software on its way to production.

Executable specs / TDD

If a requirement cannot be written as a test, it is not a good requirement yet. That claim sounds like a methodological preference, but it is actually a claim about information quality. A requirement precise enough to be expressed as a test is a requirement that admits a binary answer: does the system do this or not? Testability is a proxy for specification clarity — and specification clarity is the foundation on which any meaningful evidence record rests.

For a human developer, the discipline of writing the test first forces a conversation with the spec before a line of implementation exists — when the cost of ambiguity is low and the cost of correction is even lower. For an autonomous agent, the stakes are higher: an agent without unambiguous, machine-checkable acceptance criteria can produce code that is locally coherent but systematically wrong. The test is not a safety net you add after the work; it is the specification the agent is executing against.

The evidence consequence is direct: "done" means acceptance evidence green, bound to the commit. The record can say not just "tests passed" but "these specific acceptance criteria passed against this exact commit tree, at this timestamp, via this check." That is a verifiable statement — not a vibe.

Trunk-based development

The structural premise of trunk-based development is simple: main is always deployable, and every change is integrated continuously — in small, complete increments, gated by evidence. The branch exists only for the duration of a single focused change; it merges back before the day is out, or before the context has decayed.

Long-lived branches are, among other things, provenance graveyards. A change that lives on a branch for two weeks accumulates context that is never recorded anywhere: why that decision was made, which intermediate states were tried and discarded, which security consideration was weighed and accepted. When the branch finally merges, the evidence is reconstructed — typically from a squashed commit message and a PR description that was written in a hurry. That reconstructed record is not evidence; it is a story told after the fact.

Long-lived branches are also where the blast radius of a supply-chain attack or a zero-day disclosure becomes genuinely hard to bound. If main is always deployable and every change has been integrated and gated, you can answer "which deployed versions are affected?" in minutes. If your codebase is split across seventeen long-lived feature branches at various stages of merge conflict resolution, the answer is "we don't know yet" — and that answer has a cost.

Continuous delivery

Continuous delivery means one path to production: the pipeline. Not "a pipeline that most changes go through, plus a few known exceptions for emergencies, plus the occasional out-of-band deploy that someone will document later." One path. Every change, every time, via the same mechanism, which runs the same checks, which produces the same evidence shape.

The value of that invariant is not primarily velocity — though velocity follows. The value is that every release has the same evidence shape. When an auditor asks for the evidence that release 1.4.2 went through the required controls, you do not need to know the history of that particular release to know where to look. The pipeline ran; the record exists; the attestation is signed. There can only be one way to production: the pipeline.

For agents specifically, the single-path invariant is a containment boundary. An agent that can deploy out of band — by pushing directly to a production system, by invoking a cloud API with a privileged credential, by any mechanism other than the pipeline — is an agent that can bypass every control you have. The pipeline is not a performance artifact; it is the gate.

GitOps and infrastructure as code

There is a variable the first three disciplines quietly assume away: the target. Tests pass, the trunk is green, the pipeline runs — but against what? A test result is only reproducible if the environment it ran in is itself defined, versioned, and reconstructable. If the infrastructure, the configuration, and the network were assembled by hand — or drifted since they were — then "the tests passed" attests to a state nobody can reproduce, and the evidence describes a system that no longer exists.

So the infrastructure layer has to be treated exactly like the application: declared as code, changed only through the pipeline, and reconciled continuously against git as the single source of truth. This is GitOps — git is not just where the code lives, it is where the running system is operated from. The deployed state is expected to match what git declares; a manual change to a live environment is not a shortcut, it is drift, and drift is an error to be reconciled away. Infrastructure is tested before it ships the same way application code is, against the same conditions it will run in — which closes the most common reproducibility hole of all: validating one version while a different one reaches production.

The compliance payoff is direct. When everything — infrastructure, configuration, policy, controls — is code under merge-request control, the same machinery that proves your application also proves the ground it runs on: who changed the environment, when, under whose approval, and exactly which declared state produced a given release. Reproducibility stops being an aspiration and becomes a property of the system: any version, rebuilt from git, yields the same result. For agents, this is decisive — an agent operating on a git-declared, continuously-reconciled system can only change it by proposing a change to git, where every control already applies. The blast radius of an autonomous actor is bounded by the same evidence-gated path as everyone else.

The hypothesis stated plainly

Make good engineering the path of least resistance, and you simultaneously achieve two things that previously looked like separate investments: you make agent-driven development safe, because the controls are explicit, machine-checkable, and bound to evidence; and you produce the provable, tamper-evident record that your auditors and regulators require — as a byproduct of working well, not as a separate compliance effort bolted on afterward.

The three disciplines are not constraints on productivity. They are the architecture that makes productivity trustworthy — at human scale today, and at agent scale tomorrow.

3. The solution

The zegit-zoo platform is three composable tools — meerkat, mongoose, and zegit — that together close the loop between the work and the proof. Each stands alone and composes with the others. Together, they implement the hypothesis: know, work, prove.

meerkat knows

meerkat is a knowledge base served over MCP — the Model Context Protocol — as a single Go binary that bundles curated organizational knowledge and makes it queryable by any MCP-compatible client. Guides, systems diagrams, architectural decisions, security policies, onboarding runbooks: whatever your organization has decided is true and worth preserving is embedded at build time and available offline.

The problem meerkat addresses is not retrieval — it is grounding. An agent that starts from a fresh context window starts from ignorance. It will rediscover things your organization already knows, make decisions that have already been made and reversed, and ask questions that have already been answered. meerkat gives every agent — and every developer — a shared, curated starting point. Work begins from what is actually true at your org, not from what the model happens to have encoded in its weights.

mongoose works

mongoose is an agent harness — an OpenCode-style terminal UI — where developers and their agents do the work. Multi-provider model routing means the harness is not coupled to any single LLM provider; the developer chooses the model appropriate to the task. meerkat is wired in by default, so the agent always has access to the organizational knowledge base over MCP without additional configuration.

The evidence dimension is central to mongoose's design, not peripheral. Every meaningful step — a tool call, a file edit, an acceptance test run, a diff committed — emits a structured evidence record bound to the git state it ran against. The record captures what happened, when it happened, under what identity, and against which exact commit tree. When zegit is present in the environment, those records flow into the governance gateway. When it is not, they accumulate as a local, unsigned log — useful for the developer, and ready to be elevated the moment governance is needed.

zegit proves

zegit is the governance gateway. It sits in front of your existing Git host — no migration required — and processes every record that arrives at a push or tag event. The policy engine runs a decision against the record: ALLOW if the evidence satisfies the configured policy, REQUIRE_REVIEW if a quorum of human approvals is needed before the push completes, BLOCK if a required control is absent or failed.

Passing records are signed into DSSE attestations — Dead Simple Signing Envelopes, a standard format for tamper-evident provenance — and published into a refs/zegit/aov/ namespace in the repository itself. They travel with the code, in the same store, verifiable by anyone with the public key, without requiring access to a central service. Tamper with the code after the fact and the attestation goes stale — by construction, not by policy.

These attestations are in-toto compliant. Each one is an in-toto Statement (predicateType https://zegit.dev/attestation/aov/v0.1) whose subject binds it to the exact git commit, carried in a DSSE envelope with the standard application/vnd.in-toto+json payload type. This matters because the proof is not a zegit-specific format you have to trust us to read — it is the same open standard the wider supply-chain-security ecosystem already speaks (SLSA, cosign, and other in-toto verifiers). An auditor, a customer, or a future tool can verify a zegit attestation with off-the-shelf software, with no dependency on zegit being around. Open, verifiable formats are the difference between evidence and a vendor's say-so.

At release time, zegit bundles the per-commit attestations into a CRA Evidence Bundle — a signed, structured artifact mapped to the requirements of CRA Annex I. The bundle records who did what and which quality controls ran, for the specific version being released, in a format that a conformity assessment body can verify offline. The scattered manual pipeline cannot produce this. The pipeline, run through zegit, produces it as a byproduct of every release.

One record, three trust levels

The same evidence record travels through three trust levels across the platform — a continuum that corresponds directly to the value delivered at each stage:

unsigned mongoose, free / OSS — raw local log

validated policy decision carried — zegit gateway

signed DSSE attestation — audit grade, offline-verifiable

The unsigned record is generated by mongoose at no cost — it is the open-source entry point. The validated record carries a policy decision from the zegit gateway; it knows whether the evidence satisfied the configured policy for that push. The signed record is a DSSE attestation, bound to the commit, verifiable years later without access to any live system. That escalation is the spine of the platform — and the path from open-source adoption to commercial governance.

How it fits together

The flow is linear and intentional: know → work → prove. meerkat grounds the agent in organizational truth before a line of code is written. mongoose is where the work happens — where the developer and the agent write, test, iterate, and commit — and where the evidence is born, bound to git state at every meaningful step. zegit makes that evidence provable and gates delivery on it; nothing reaches production through any path other than the pipeline, and the pipeline cannot complete without the evidence satisfying the policy.

Each tool is useful alone. mongoose without zegit gives you a multi-provider agent harness with a local evidence log — a better terminal session, with a useful audit trail. zegit without mongoose can govern any git workflow; it is agnostic about how the evidence was produced. meerkat can serve any MCP client. But together, they close a loop that no combination of point tools currently closes: the agent is grounded, the work is evidenced, and the evidence is provable — end to end, in a single coherent record, without a separate compliance effort.

Compliance is the floor; good engineering is the point.