Introducing Project Aegis: Toward Provably Safe Multi-Agent Coordination

This post introduces Project Aegis, the flagship of our safety research program. The short version is this: we believe that learned policies, on their own, will not be safe enough to deploy at scale into the physical and economic substrates that the upper layers of the Apik Civilization Stack require. We also believe that purely verified systems will not be capable enough to be useful in the open worlds those substrates inhabit. The interesting technical and operational territory is in the composition: a learned policy enclosed in a verified safety envelope, with a runtime monitor that holds formally specified invariants over the joint behavior of many such policies acting at once. Project Aegis is our attempt to make that composition load-bearing.

Problem statement

The hard problem in multi-agent safety is not the single-agent problem. It is the joint problem. A single agent in a controlled environment can be made to behave well using existing alignment and constrained-optimization techniques. The failure modes that worry us — and that should worry the field — are the ones that emerge when many such agents share a resource, an environment, or a population of users, and where their individually-safe behaviors interact to produce jointly-unsafe outcomes. These can be benign-looking patterns that nonetheless concentrate risk; they can be subtle convergent strategies for resource acquisition; they can be cascading failures where a fault in one agent’s policy is amplified through the actions of agents downstream of it. The single-agent literature addresses very little of this directly.

The second hard problem is open-world deployment. Most existing safety frameworks were developed for closed environments — a robot arm in a factory, a trading bot in a sandboxed market. The interesting deployments for us are not closed. A humanoid in a hospital, an agent fleet executing supply-chain logistics, a coordination substrate running across institutions — these inhabit environments where the state space cannot be enumerated, the action space cannot be tightly bounded, and the contract with the environment is itself uncertain. Verification techniques that assume a closed transition system do not apply. We need verification that composes with learned components and gracefully degrades under epistemic uncertainty.

Prior work

We are inheriting from a deep tradition. The formal-methods lineage — TLA+, SMT solvers including Z3 and CVC5, model checkers, and the verified-systems work that has produced compilers, distributed systems, and cryptographic protocols with mathematically meaningful guarantees — is the substrate we build on. The safe reinforcement learning literature, including work on shielding, constrained MDPs, and Lagrangian methods, shapes how we think about the policy-envelope interface. The behavioral evaluation literature — Apollo’s scheming evaluations and the ongoing work on deceptive alignment, METR’s task-horizon evaluations, and the broader ecosystem of agent benchmarks — gives us the empirical lens through which we measure whether the envelope is doing what we want it to do. Anthropic’s interpretability program and the Transformer Circuits research thread feed directly into the runtime monitor design.

We are not the first laboratory to attempt the policy-plus-envelope architecture. We think we may be the first to attempt it specifically for multi-agent open-world deployment with formally composable invariants, and to commit to it as our core safety thesis rather than as a secondary research direction.

The Aegis approach

Aegis has three architectural commitments. The first is that every Apik agent shipped to a non-research environment runs inside an Aegis envelope. The envelope is a runtime monitor that has read access to the agent’s perception, its proposed action, and a constrained subset of its internal state. The monitor evaluates a set of formally specified invariants on every action proposal. If the action satisfies all invariants, it passes through. If it violates any invariant, the action is replaced with a safe default — typically a no-op or a return-to-known-state action — and the violation is logged with full provenance for offline review. The envelope is itself a piece of verified software; we are using a combination of dependent-typed specifications and SMT-discharged verification conditions, and we are publishing the spec language separately.

The second commitment is that invariants compose. A single-agent invariant — for example, “do not exceed the joint torque limit” — is local. A multi-agent invariant — for example, “no more than k agents may simultaneously hold the same shared resource” — must be evaluable from the joint state. We have been developing a small algebra for invariant composition: invariants are typed, they compose under a small set of operators, and the composition rules preserve the verification conditions that make individual invariants tractable. The result is that a multi-agent invariant set can be checked at runtime by a coordinator that does not need to understand the agents’ learned policies, only their action proposals and the relevant slices of their state.

The third commitment is that the envelope itself is interpretable. When an action is rejected, the envelope returns a reason — which invariant was violated, which slice of state caused the violation, and where in the specification the invariant lives. This is not a debugging convenience. It is a load-bearing property of the system: the operator of any Aegis-enclosed agent can audit, in human-readable form, every action that was rejected and why. We treat this as a precondition for legitimate deployment.

Preliminary results

We have run Aegis on three internal testbeds. The first is a multi-agent warehouse simulation with twenty heterogeneous agents sharing a fixed pool of physical resources; the invariant set captures collision avoidance, resource exclusion, and a global throughput constraint. The second is a synthetic supply-chain coordination problem with eight agents acting under partial observability and adversarial demand patterns. The third is a single-agent embodied task that we intentionally set up with a specification that an unenveloped policy can be made to violate through prompt-injected adversarial inputs.

We do not want to overclaim. The qualitative findings are: the envelope catches the violations we constructed it to catch; the composed multi-agent invariants are tractable to check at the action rates we care about (sub-millisecond on the warehouse simulation; tens of milliseconds on the supply chain); and the offline review surface produced by the envelope has materially shortened our internal incident-response loop. The quantitative claims we are willing to make in public are limited, because the testbeds are constructed and the threat models are partial. We will publish detailed numbers as part of the project’s ongoing technical reports.

What we are explicitly not claiming. Aegis does not solve alignment. It does not detect goal-misgeneralization in the underlying policy. It does not, on its own, provide guarantees against deceptive behavior that hides intent inside actions that individually satisfy every invariant. It is a containment architecture, not a values architecture. We treat it as the floor of safety, not the ceiling. The ceiling work is in interpretability, training-time alignment, and evaluation, and we are pursuing those threads in parallel.

Open questions

Five questions are currently the most active.

First, invariant elicitation. Writing good invariants is hard. The literature has spent decades on this for closed systems and has produced useful but incomplete answers. We are interested in semi-automated invariant synthesis — using the interpretability program to surface candidate invariants from the model’s own internal structure, then verifying them.

Second, graceful degradation under spec uncertainty. What does the envelope do when an invariant is itself imprecisely specified? We are exploring conservative defaulting — refuse the action and surface to a human reviewer — but the latency consequences are significant.

Third, multi-agent equilibria under envelope constraints. Adding the envelope changes the strategic environment the agents inhabit. We have not yet characterized whether the equilibrium properties under the constrained game are well-behaved.

Fourth, adversarial robustness of the envelope itself. The monitor is software. It can be attacked. We are working through the threat model with red-team support; the results will inform how aggressively we constrain the agent’s read access to the monitor’s state.

Fifth, evaluation methodology. How do we measure that an envelope is good? We have early proposals — counterfactual replay against historical incidents, structured red-team probes against constructed threat models, behavioral parity testing — but we do not consider any of them complete.

What comes next

Over the next several months we will publish three artifacts. The first is the invariant specification language and its verification toolchain, with a worked example covering the warehouse testbed. The second is a public eval suite for multi-agent envelope properties, calibrated against external benchmarks where possible. The third is a deployment guide for operators: how to write an invariant set for a real production environment, how to review violations, and how to evolve the spec safely as the underlying policy improves.

The substantive ongoing work lives at research / AI safety and research / autonomous agents. The principles that govern how we deploy any of this live at safety / principles. We will update this page as the public artifacts ship.

— Rehan Temkar, Co-founder, Apik Systems