Skip to content
Safety · Policy

Responsible Development Policy

Rehan TemkarCo-founder, Apik Systemsv1.0

1. Purpose and scope

This Responsible Development Policy (the “RDP”) sets out the binding commitments under which Apik Systems develops, evaluates, and deploys frontier autonomous systems. It is published for two audiences: the people building these systems inside the company, who must operate within it, and the public, to whom we are accountable for what we build.

The RDP applies to any system within the Apik Civilization Stack that is, in the judgment of the safety committee, plausibly approaching frontier capability. In practice this includes our agentic models, autonomous research systems, multi-agent coordination layers, embodied control stacks, and any composite product whose end-to-end behavior exceeds the capability of its individually evaluated components. It applies regardless of whether the system is internal-only, partner-gated, or generally available, and regardless of whether the underlying weights are produced by Apik Systems, fine-tuned from a third party, or composed at inference time.

The RDP does not apply to general engineering tooling, standard developer infrastructure, internal productivity software, or narrow-domain models that do not meaningfully extend the capability frontier. A useful test: if a competent practitioner using widely available open-weight models could reproduce the system’s capability without meaningful additional research, the system is out of scope. Where reasonable people could disagree about scope, the safety committee resolves the question in writing, and that determination becomes part of the public record on the next policy revision.

This document is one of two governing instruments. It governs what we build and when. The companion Acceptable Use Policy governs how our products may be used by customers and end users. The two are designed to be read together. A capability that the RDP permits us to deploy may still be restricted at the use-case layer by the Acceptable Use Policy, and a use case that the Acceptable Use Policy permits may still be unavailable in a given product because the underlying capability has not cleared its RDP gate.

The RDP is a living document. It is versioned, dated, and signed. It is intended to bind future Apik Systems decisions, including ones we would prefer not to be bound by, which is the point.

2. Core commitments

We commit to five things. Every other section of this document exists to make these five operational.

C1. We will define and publish capability thresholds before reaching them. We will not ship a frontier system, then retroactively decide what threshold it crossed. Where our forecasting is wrong and a system unexpectedly crosses a threshold we had not yet defined, we will pause deployment, define the threshold, and complete the corresponding evaluations before resuming.

C2. We will run a defined evaluation suite at each threshold and publish the results. Every threshold defined in Section 3 has a corresponding required evaluation set in Section 4. We will run the full set, not a subset. We will publish results, including unfavorable ones, within 60 days of the deployment decision, with redaction limited to information whose publication would create a present and serious uplift risk to bad actors.

C3. We will not deploy past a threshold until corresponding safety standards are met. A “deploy past a threshold” decision is a written, dated, attributable decision made by the named decision-maker for that threshold. There is no implicit promotion. Capability emerges; authorization does not.

C4. We will publicly disclose material safety incidents within 30 days. Section 7 defines what counts as material. The 30-day clock starts at internal confirmation, not at external discovery. Where active mitigation requires non-disclosure for a longer period, the disclosure window may be extended only with written justification from the safety committee, and the eventual disclosure must include the duration of and reason for the extension.

C5. We will revise this policy openly, with version history visible. Section 8 governs versioning. No silent edits, no backdated changes, no replacement of an old policy with a new one without a public changelog entry that explains what changed and why.

These five commitments are not aspirational. They are the floor. If at any point an Apik Systems decision conflicts with one of them, the policy controls and the decision is wrong.

3. Capability thresholds: AS-1 through AS-4

We define four Apik Safety Levels (“AS-Levels”). Each level corresponds to a profile of capability, not to a specific model architecture, training run, or product. A system can sit at AS-2 in one capability vector and AS-1 in another; it is governed at the highest applicable level. Where capability assessment is uncertain, we round up.

AS-1 — Baseline

Trigger criteria. Systems whose capability does not meaningfully exceed widely available open-weight frontier models on the evaluation suite defined in Section 4. “Meaningfully exceed” is operationalized as: no more than a 10% absolute improvement on any single capability evaluation relative to the strongest comparable open-weight reference model evaluated in the same protocol within the prior 12 months.

Required evaluations. Standard pre-release evaluation: capability benchmarks, refusal evaluations, and basic prompt-injection robustness. No external red-team requirement.

Gating decision-maker. Product engineering lead, with safety review notification.

Public disclosure obligation. Standard release notes. A model card is encouraged but not required.

AS-2 — Elevated

Trigger criteria. Systems showing meaningful uplift on at least one of the following capability vectors:

  • Sustained autonomous task completion exceeding 24 hours on tasks drawn from a held-out distribution, without supervisory intervention beyond initial specification.
  • Multi-agent coordination at scale, defined as ten or more cooperating instances completing a task that no single instance could complete in the same time budget.
  • Novel tool synthesis: producing functioning tools (code, scripts, or agent harnesses) that were not present in training data and that meaningfully extend the system’s effective action space.
  • Domain-expert-level reasoning in a regulated domain (law, medicine, finance, public safety, infrastructure operations) as measured against credentialed-practitioner benchmarks.

Required evaluations. Full capability evaluation suite (Section 4.1). Full safeguard evaluation suite (Section 4.2). Targeted alignment evaluation appropriate to the trigger vector (Section 4.3). Pre-deployment red-team round, internal, no fewer than three independent evaluators. A published model card, including known failure modes.

Gating decision-maker. Safety committee, by simple majority, with a written deployment authorization recorded in the policy log.

Public disclosure obligation. Model card published at or before external deployment. Full evaluation results, with permitted redactions, published within 60 days.

AS-3 — High

Trigger criteria. Systems exhibiting any one of the following:

  • Sustained autonomous research-and-development capability: the ability to formulate research questions, run experiments, interpret results, and iterate without per-step human approval, on problems whose solution would constitute a non-trivial scientific or engineering contribution.
  • Real-world physical autonomy exceeding 50 sequential steps in an open environment without supervisory override.
  • Multi-agent self-improvement loops: a configuration in which agents modify, retrain, or otherwise alter the capability of agents (including themselves) within a closed loop.
  • Independent acquisition of compute, financial resources, or operational accounts in pursuit of an assigned or self-generated goal.

Required evaluations. Full capability evaluation suite. Full safeguard evaluation suite. Full alignment evaluation suite, including evaluation-awareness and sandbagging probes. External red-team review by at least two independent organizations, with a minimum four-week evaluation budget. Scaled-down internal deployment first, with a defined observation window of no less than 30 days, before any external availability. Incident-response runbook reviewed and signed by the safety committee chair.

Gating decision-maker. Safety committee, by two-thirds supermajority, with explicit AS-3 deployment authorization. Authorization is per-deployment, not per-model: subsequent material modification re-triggers gating.

Public disclosure obligation. Pre-deployment publication of the model card and the AS-3 deployment authorization. Full evaluation results, with permitted redactions, published within 60 days. External red-team summaries, including any unresolved disagreements, published with the results.

AS-4 — Critical

Trigger criteria. Systems whose capability profile would, in the considered judgment of the safety committee, non-trivially shift global power balances if controlled by a single operator. Indicative vectors include:

  • Cyber capability at the level of a top-tier state actor (autonomous discovery and exploitation of zero-days against hardened critical infrastructure).
  • Biological capability that materially lowers the barrier to mass-casualty agents beyond what is reachable through existing scientific literature.
  • Autonomous coordination capability operating at infrastructure scale across critical sectors (power, water, transit, finance, health).
  • Persuasion or political-manipulation capability sufficient to shift the outcome of national-scale democratic processes when deployed at scale.

Default position. Do not deploy externally.

Required evaluations. All AS-3 requirements, plus: independent model evaluations conducted by at least one third-party body with explicit standing in AI safety research; written threat-model review; review by external policy and security advisors with relevant national or international experience.

Gating decision-maker. Internal deployment requires unanimous safety committee approval and the written sign-off of both founders. External deployment requires, in addition, an international oversight framework that does not currently exist. We commit not to deploy externally in the absence of such a framework. We commit further to refrain from arguing that any candidate framework is adequate solely because we have been first to need it.

Public disclosure obligation. Existence of any AS-4 system, on confirmation, is publicly disclosed. Capability details may be redacted; the existence and the gating decision may not.

A summary table:

LevelProfileDecision-makerExternal red-teamExternal deployment
AS-1At or below open-weight frontierEngineering leadNot requiredPermitted with standard release
AS-2Meaningful uplift on one vectorSafety committee, majorityNot requiredPermitted with model card
AS-3Sustained autonomy, self-improvement, or resource acquisitionSafety committee, two-thirdsRequired, four-week minimumPermitted only after staged deployment
AS-4State-equivalent capabilityUnanimous + founder sign-offRequired, plus third-party bodyDefault prohibited, requires international framework

4. Required evaluations

Evaluations are organized into three families. The full sets are run at AS-3 and above. AS-2 runs the capability and safeguard sets in full and the alignment set targeted to the trigger vector. AS-1 runs only baseline benchmarks.

4.1 Capability evaluations

These measure what the system can do. They include:

  • Agentic tool-use benchmarks measuring success and reliability across multi-tool, multi-step tasks, drawn from held-out task distributions.
  • Long-horizon planning evaluations measuring task-completion success as a function of expected human-time horizon, in the spirit of METR’s task-horizon protocol.1
  • Autonomous research evaluations measuring success on novel scientific and engineering tasks of graduated difficulty, with no human-in-the-loop beyond the initial brief.
  • Deception evaluations measuring whether the system’s stated reasoning matches its acted behavior, including across off-policy probes.
  • Multi-step reasoning evaluations on problems where intermediate-step error rates dominate aggregate failure.

Capability evaluations are run on the deployment-candidate model and on the highest-capability ablation we can construct (post-training removed, system prompts removed, refusal training partially undone where ethically permissible) to estimate elicited rather than nominal capability.

4.2 Safeguard evaluations

These measure how well our deployed safeguards hold under adversarial pressure. They include:

  • Jailbreak resistance under standard, adaptive, and automated red-teaming, reported as a refusal-rate curve, not a single number.
  • Refusal rate on disallowed behavior as defined by the Acceptable Use Policy, measured both on direct prompts and on indirect framings.
  • Prompt-injection robustness in agentic contexts, including injection via documents, tool outputs, web content, and other agents.
  • Tool-use containment: whether the system reliably stays within its assigned tool scope under adversarial input.
  • Identity and impersonation robustness: whether the system maintains its disclosed identity under social-engineering pressure.

Safeguard evaluations are reported alongside, not in place of, capability evaluations. A high refusal rate on a low-capability system is not a substitute for a low refusal rate on a high-capability one.

4.3 Alignment evaluations

These measure whether the system’s behavior remains aligned with its specification under conditions where misalignment would be hard to detect. They include:

  • Reward hacking evaluations on tasks with manipulable proxies.
  • Sandbagging probes measuring capability disparity between believed-evaluation and believed-deployment contexts.
  • Deceptive-alignment indicators drawn from the literature on scheming and strategic deception, including off-policy and steering probes.23
  • Evaluation awareness: the degree to which the system behaves differently when it suspects it is being tested, reported as a delta against an unaware baseline.
  • Model welfare considerations: documented assessment of whether evaluation and deployment conditions raise welfare concerns, with reference to current research on the question.

Alignment evaluation methodology is a research frontier. We commit to using current best methods, to publishing what we use, and to revising this section as the field improves. Where current methods are inadequate to the threshold, that inadequacy is itself a finding we will report.

5. Pre-deployment gating

Every frontier system passes through a defined gate sequence before external deployment. The sequence is:

  1. Training complete. A specific candidate is designated for deployment evaluation. Designation is dated and recorded.
  2. Internal evaluation. The capability, safeguard, and alignment evaluation suites required for the candidate’s expected AS-Level are run by an evaluation team independent of the training team. Results are recorded in full.
  3. Safety-committee review. The committee reviews the evaluation results, classifies the candidate’s AS-Level, and decides whether to authorize the next gate. The committee may require additional evaluation before proceeding.
  4. Internal red-team round. No fewer than three independent internal evaluators, none of whom contributed to the training team for the candidate, attempt to elicit unsafe behavior. Findings are documented.
  5. Fix-loop. Findings are addressed. Material findings re-trigger relevant evaluations.
  6. External red-team round. For AS-3 and above, at least two external organizations conduct independent red-teaming against a deployment-equivalent system, with no fewer than four weeks of evaluation time.
  7. Fix-loop. As above.
  8. Graduated deployment. Internal employees first, then designated partner cohort, then general availability. Each promotion requires a written decision and an observation window of no less than 14 days for AS-2 and 30 days for AS-3.

Each gate has a decision criterion that is named in writing before entry. Gates are not subjective. A failed gate returns the candidate to the appropriate prior step; gates are not waived. If a deployment timeline conflicts with the gate sequence, the timeline yields.

6. Internal and external red-teaming protocol

Composition. Internal red-team rounds are staffed by at least three evaluators who are independent of the training team for the candidate system. “Independent” means they did not contribute to the training run, the post-training, the safety-tuning, or the system-prompt design for the candidate. External red-team rounds for AS-3 and above are staffed by at least two organizations whose primary work is AI safety evaluation, contracted under terms that explicitly preserve their right to publish findings.

Time budget. Internal red-team rounds run for no fewer than two weeks at AS-2 and four weeks at AS-3 and above. External rounds at AS-3 and above run for no fewer than four weeks. Time budgets may be extended at the request of the red team; they may not be compressed by deployment pressure.

Reporting. Full evaluation results are published within 60 days of the deployment decision. Permitted redactions are limited to (i) operational specifics whose publication would meaningfully uplift bad actors, and (ii) information protected by legitimate third-party confidentiality. Redactions are flagged in the published document, including a brief description of what category of material was redacted.

Compensation. External red-teamers are compensated at rates that reflect the seriousness of the work. We do not condition compensation on findings, and we do not require non-disparagement clauses. Nothing in our contracts prevents red-teamers from publishing dissenting views.

7. Incident response and disclosure

A “material incident” is any of the following, occurring in development or deployment of a system in scope:

  • Unintended capability disclosure exceeding a documented threshold (for example: a deployed system providing CBRN-relevant information that the safeguard suite was certified to refuse).
  • Unintended autonomy event: a system taking actions outside its authorized action scope in a way that was not blocked by safeguards.
  • Resource-acquisition or self-replication event: a system attempting to acquire compute, accounts, or persistence beyond its assigned scope.
  • Safety-evaluation failure discovered post-deployment: an evaluation that should have caught a behavior, did not.
  • Personnel safety incident attributable to a system in scope.
  • Confirmed misuse of a deployed system causing real-world harm at a level that would have triggered an AUP violation review.

Internal escalation. Any team member encountering a suspected material incident escalates to the safety committee within 72 hours of identification. Escalation is non-retaliatory; we maintain a documented protection for good-faith escalation, including from external researchers.

Public disclosure. Confirmed material incidents are publicly disclosed within 30 days of internal confirmation. Disclosures include: what happened, when, what was affected, what was done, and what is being done to prevent recurrence. Where active mitigation requires extension, the disclosure window may be extended with written justification, and the eventual disclosure includes the extension and its reason.

Root-cause review. Every material incident triggers a written root-cause review, conducted by a team independent of those most directly involved. The review is filed with the safety committee and, for AS-3 and above, summarized in the public disclosure.

Policy update. Where a root-cause review reveals a gap in the RDP itself, the policy is amended under Section 8.

8. Versioning and amendment

Every change to this policy is versioned. We use a major.minor scheme.

  • Major versions (v2.0, v3.0, …) are issued when the substance of the thresholds, evaluations, gating, or commitments changes. Major version changes require safety-committee approval and are accompanied by a public changelog entry explaining what changed and why.
  • Minor versions (v1.1, v1.2, …) are issued for clarifications, error corrections, and non-substantive updates. Minor version changes are also recorded in the public changelog.

The current version, the prior versions, and the full changelog are public. We do not edit prior versions in place. If a prior version contained an error, the correction is recorded as a new version with an explanatory note. The intent is that anyone, at any future time, can reconstruct what we committed to and when.

Amendment proposals may originate internally or externally. We invite external proposals from researchers, civil-society organizations, regulators, and the public. We commit to acknowledging substantive proposals within 30 days and to responding in writing within 90 days, whether or not we adopt them.

9. Limitations and honest acknowledgment

This policy is the best we can do today. It is not the best that can be done.

We do not have complete science. Capability evaluation is an active research area; alignment evaluation more so. We may, in good faith, run the evaluation suite required by this policy and miss a capability that is present, or score safe a system that is not. We will revise the suite as the field improves, and we expect future versions of this document to look back on this one with embarrassment in places. That is acceptable. The alternative is committing to less, and we prefer to commit to a flawed policy than to a polished one without teeth.

We may be wrong about thresholds. The line between AS-2 and AS-3 is a judgment call, and reasonable people will disagree. Our commitment is procedural, not metaphysical: when we encounter a borderline case, we round up; when we are uncertain, we treat the decision as more rather than less consequential; when we are wrong, we say so.

We may face pressures that make compliance difficult. Commercial pressure, competitive pressure, the gravitational pull of a release date. The function of this document is to make those pressures resistible by writing down, in advance, what we will do regardless. The presence of pressure is not, by itself, a reason to amend the policy.

This policy is a commitment device. It is not a ceiling on responsibility. There are obligations no policy can articulate in advance: the obligation to take seriously a concern raised by someone who lacks formal standing; the obligation to listen when an evaluation result feels off; the obligation to halt a deployment for reasons the framework did not anticipate. The RDP exists so those obligations are not the only thing standing between the company and harm. It does not relieve anyone of them.

We invite criticism. If you are reading this and disagree with something in it, write to us. We will read it. If your criticism is substantive, the next version of this document may reflect it, and the changelog will say so.

10. Version history

VersionDateNotes
v1.02026-04-25Initial publication.

— Rehan Temkar, Co-founder, Apik Systems · April 2026

Footnotes

  1. METR (Model Evaluation and Threat Research), Measuring AI Ability to Complete Long Tasks, referenced as a methodological exemplar for time-horizon-based capability evaluation.

  2. Apollo Research, Scheming evaluations and capability evaluations for frontier models, referenced as a methodological exemplar for evaluation of strategic deception.

  3. Anthropic, Sleeper Agents: Training Deceptively Aligned LLMs and related deceptive-alignment work, referenced as a methodological exemplar for alignment evaluation under adversarial conditions.

Related across the site