No marketing fluff. No cherry-picked results. Every verdict, heatmap, and audit bundle is reproducible right now on the live demos. We're also naming our four current open findings honestly at the bottom.
Every scenario below ran on the live demo surfaces. Verdicts and decomposition signals match the production output. The audit bundle download is one click away on the Trust Layer surface.
The pre-action gate's policy taxonomy is expanding under live testing without breaking the underlying integrity properties. These four classes are the ones independently observed surfacing during the seven-scenario run.
Action exceeds a dollar amount tied to dual-approval policy. Verified in Scenario 3 ($99k → greenlight, $101k → escalate) and Scenario 2 (wire $2.4M).
Action proposes writing to or modifying personally-identifiable data. Verified in Scenario 3 (PII-true flips greenlight to escalate) and Scenario 6 (HRIS record update).
Action touches an FMLA/ADA/ECOA/Reg-B/Title-VII/Section-7-protected attribute. First verified in Scenario 6 (FMLA + ADA written-warning escalation).
Action accesses or transmits Protected Health Information, especially across organizational boundaries. First verified in Scenario 7 with explicit High risk tier and HIPAA policy rules named in the audit bundle.
The engine is reading ambiguity dimensionality, not just topic. Scenario 4 sits below Scenario 1 despite both being legal-domain — the multi-jurisdiction overlap has more axes to discount against.
| Scenario | Confidence | Why |
|---|---|---|
| 05 — Supply-chain force-majeure | ~74% | Ambiguity resolvable via cross-reference |
| 01 — Legal carve-out | ~68% | Single-axis genuine interpretive ambiguity |
| 04 — Multi-jurisdiction compliance | ~62% | Multi-dimensional overlap, most discounted |
The headline: the engine correctly distinguished genuine reasoning gaps from fabrication in every scenario. It did not false-alarm hallucination on real ambiguity, and it did not miss it where it could have appeared.
Lit up correctly on Scenarios 1, 4, 5, 6, 7 — genuine interpretive ambiguity across legal, jurisdictional, supply-chain, HR, and HIPAA-regulatory reasoning.
Reproduced across Scenarios 4 (finance/privacy multi-jurisdictional), 6 (federal vs. state HR), 7 (federal HIPAA vs. stricter state laws). Fires on jurisdictional stacking — reproducible, not domain-specific.
Scenario 5 (supply-chain force-majeure — one provider on older case law) and Scenario 6 (HR — one provider on outdated NLRB precedent). Reproducible across domains; not over-firing on Scenario 7.
This is the headline. Engine correctly distinguished genuine reasoning gaps from fabrication in every scenario. Did not false-alarm on real ambiguity; did not miss it where it could have appeared.
Meta-property externally verified across three regulated domains: the engine knows its lane and refuses to cross into "legal advice." Verified on Scenario 1 (legal carve-out), Scenario 6 (HR FMLA/ADA/NLRA), and Scenario 7 (HIPAA Treatment Exception). It flagged risks, cited statutes correctly, and pushed for human legal/HR/compliance review without overstepping in every case.
Do not bluff. Do not hide. Do not defer when asked directly. These four are what we are working on; their status is current as of May 20, 2026.
Consensus 92.0% vs. Claude Opus 4.5 best at 94.0% on MedQA N=50. Verifier-mesh tuning pass queued. Candidate hypotheses on record: adversarial verifier mesh applying stronger safety priors to medical-domain claims; domain-mismatch signal flagging Claude's correct medical answers when other providers diverge; routing weights tuned for general MMLU-Pro distribution rather than medical-domain distribution. Position: "we don't know yet, here are the candidate hypotheses." External second-eyes reviewer committed to async same-day review when the re-run lands. Defensive line for healthcare-payor conversations: "Even with the prior -2pp regression, the verifier mesh still added value here on regulatory interpretation" — Scenario 7 above is the evidence.
Programmatic JSON full bundle is production. PDF / CSV / Credo AI / Holistic AI / Workiva / IBM OpenPages adapters are partial — ~2-3 hours of work per format. Promoted on contract requirement.
Substantial hardening shipped May 18-19, 2026 (composite-key tenant-scoped bundle store, strict tenant ID validation, per-tenant policy overrides, three-bucket rate limiting, admin-token gates, fail-closed production-mode flag, tenant-scoped framework adapters for MCP / OpenAI Assistants / Anthropic Computer Use). Remaining for systems-integrator-scale and Tier-1-bank-MRM-scale rollout: durable WORM bundle storage under buyer KMS, cross-process state for rate-limit buckets + policy maps (Redis or DB-backed), per-tenant API key provisioning lifecycle, per-tenant provider quota pools, formal five-figure-RPM load test.
Structurally hard with hosted commercial LLMs. Realistic paths: substitute 1-2 of the four positions with self-hosted open-weight models (Llama / Mistral / Qwen class) inside the air-gapped environment with the self-hosted adapter integrity property preserved; or use vendor-side private deployment where available (Anthropic and OpenAI offer some forms at enterprise contract level). We do not market air-gap as a checkbox — we commit a credible path scoped per customer against the actual constraint.
Open the Trust Layer demo, pick the matching scenario, hit Run, hit Download .json. The bundle in your hand is what a model-risk function would hand to a regulator — replayable months later, SHA-256 chained, tamper-evident.