Founding RFC

AI Neurosurgery

How Kenshiki reads inference-time signals inside the model instead of treating generation as an opaque black box.

1,618 words · ~8 min

Status: Founding RFC — Approved

The black box is a choice

Current generative AI architectures rely on probabilistic self-governance and single-pass streaming, prioritizing low latency over epistemic rigor. Kenshiki starts from a different premise: consequential model behavior should be observed, scored, and constrained at inference time — not trusted at face value and reviewed after the fact.

The system treats every generation as an observable process. Token probabilities, entailment signals, stability analysis, and causal attribution all expose whether a model answer is grounded or merely fluent.

The bounded-synthesis pipeline

Kenshiki does not ask the model to evaluate its own output. The truth boundary is external. Evidence is defined in Kura. The model generates from bounded context. The Claim Ledger evaluates what comes back. The Boundary Gate decides what is allowed to leave.

The pipeline runs in stages:

Kura establishes the evidence boundary — source material with provenance, structure, and retrieval boundaries
Prompt Compiler rewrites the request into a governed query using CFPO (Content–Format–Policy–Output), positioning evidence where attention mechanisms will weight it
Generation produces a response from bounded context — the model sees only what Kura and the Crosswalk authorize
Claim Ledger decomposes the response into atomic claims and evaluates each one through multiple layers of verification
Boundary Gate makes the final emission decision based on Ledger output

How the Claim Ledger reads the model

The Claim Ledger does not score responses as monoliths. It decomposes every response into atomic factual assertions, then evaluates each claim through a hierarchy of signals — from deterministic extraction through causal attribution.

Layer 1: Calibrated confidence

Token-level logprob distributions reveal where the model is certain and where it is guessing. Raw logprobs are converted into calibrated correctness probabilities, with separate scoring for critical tokens (numbers, entities, dates, citations) versus connective language.

Low-confidence spans are flagged. Brittle completions — where small perturbations collapse the answer — are detected before they reach the caller.

Layer 2: Source entailment

Each extracted claim is scored against the governed evidence in Kura using embedding similarity and natural language inference (NLI). The system checks whether the evidence actually entails the claim, not just whether the claim appears near the evidence.

Tier-weighted aggregation accounts for source authority. A primary authoritative source carries more weight than a supplementary reference. The Crosswalk’s SIRE metadata (Subject, Included, Relevant, Excluded) enforces the authority boundary at the retrieval level.

Layer 3: Stability analysis

Multi-draw regeneration produces multiple responses from the same bounded context. Semantic clustering identifies which claims are reproducible across draws and which are stochastic artifacts — fluent-sounding but not stable under repetition.

A claim that appears in 4 of 5 draws with consistent semantics is meaningfully different from one that appears once with high confidence. Stability separates signal from noise in the model’s output distribution.

Tier availability: L3 requires multiple inference passes from the same bounded context. In Workshop (external model APIs), this is available where the API supports deterministic sampling. In Refinery and Clean Room (self-hosted inference), multi-draw is a native capability with full control over sampling parameters.

Layer 4: Representation uncertainty

Hidden-state probes detect internal volatility that is not visible in token-level confidence. A model can produce a high-confidence token while its internal representations are unstable — the surface looks certain but the computation that produced it is not.

This layer catches the failure mode where calibrated confidence and entailment both pass, but the model’s internal state reveals that the answer was arrived at through an unstable reasoning path.

Tier availability: L4 requires access to model internals (hidden states, activation patterns). This layer is available only in Refinery and Clean Room where the inference engine is self-hosted. In Workshop, the Claim Ledger operates with L1–L3 signals only — calibrated confidence, entailment, and stability — without hidden-state probes.

Contrastive causal attribution

The foundational question the Ledger answers is not “did the model see the evidence?” but “did the evidence cause this specific claim?”

The contrastive information gain measures the causal shift in model certainty when governed context is present versus absent:

$\Delta = \log P(\text{token} \mid \text{prompt} + \text{evidence}) - \log P(\text{token} \mid \text{prompt alone})$

When $\Delta$ is significantly positive, the Ledger can mathematically prove that the injected context exerted a direct causal influence on the generation of that specific factual token. This separates grounded claims from the model’s pre-training priors.

This is the primary causal proof in the system. Token confidence tells you the model is certain. Entailment tells you the evidence is relevant. Contrastive attribution tells you the evidence actually caused the answer.

What this can and cannot prove

This architecture provides positive attribution guarantees. It mathematically proves what external evidence influenced a specific claim.

It does not provide complete exclusion proofs of pre-training priors. The system cannot cryptographically guarantee that a parameter deep within the model’s weights did not subtly shape the grammatical connective tissue surrounding the verified claims. The Claim Ledger bounds the extracted facts, not the latent reasoning that connected them.

This is a deliberate design boundary, not a limitation to be solved later. The Kenshiki platform is precise about what it proves and transparent about what it does not. Workshop, Refinery, and Clean Room each increase the depth of proof available at the model boundary — but none claim epistemic omniscience.

Why this matters operationally

In high-stakes systems, post-hoc review arrives too late. Observability only matters if it feeds a deterministic control path that can block, label, or escalate before unsupported output reaches an operator, customer, regulator, or patient.

Unsupported claims are stopped before emission, not explained after failure
Operators can inspect why an answer passed, failed, or drifted under scrutiny
Governance is measurable at the claim level, not aspirational at the policy level
Failure analysis moves from guesswork to signal-backed diagnosis
Every evaluation produces a structured record — what was asked, what evidence was in scope, what claims were made, what was supported, and why the output received the state it did

The output states — AUTHORIZED, PARTIAL, REQUIRES_SPEC, NARRATIVE_ONLY, BLOCKED — are the visible surface of this machinery. Each state is not a label applied by heuristic. It is the conclusion of a multi-layer evaluation pipeline that can show its work.

Algorithm: Inference-Time Observability Pipeline

Input: Response R from the model, compiled prompt P, governed evidence set C’, deployment tier T. Output: Per-claim observability record O, output state S.

Steps:

Claim decomposition. Decompose R into atomic claim spans F using out-of-band extraction (dependency parsing, entity recognition, rule-based matchers). Each span has exact token coordinates and a claim type.
L1: Calibrated confidence. For each claim span f in F, extract token-level logprob distributions. Convert to calibrated correctness probabilities. Score critical tokens (numbers, entities, dates, citations) separately from connective language.
L2: Source entailment. For each claim span f, score against governed evidence C’ using embedding similarity and NLI. Apply tier-weighted aggregation based on SIRE subject match (direct vs. relevant).
L3: Stability (Refinery/Clean Room). If tier T supports multi-draw regeneration: produce N responses from the same bounded context P. Cluster claims semantically across draws. Score each claim by cross-draw reproducibility.
L4: Representation uncertainty (Refinery/Clean Room only). If tier T provides access to model hidden states: probe hidden-state stability at claim token positions. Detect internal volatility not visible in token-level confidence.
Contrastive attribution. For each claim span f: compute $\Delta = \log P(\text{token} \mid \text{prompt} + \text{evidence}) - \log P(\text{token} \mid \text{prompt alone})$ . Classify as evidence-caused ( $\Delta$ significantly positive), prior-consistent ( $\Delta$ near zero), or prior-contradicted ( $\Delta$ significantly negative).
Composite verification. For each claim, compute verification status from available layer scores and deployment tier: VERIFIED, PARTIALLY_VERIFIED, UNVERIFIED, or REFUSED.
Output state assignment. Map composite verification across all claims to output state: AUTHORIZED (all verified), PARTIAL (mixed), REQUIRES_SPEC (missing context), NARRATIVE_ONLY (advisory), BLOCKED (contradicted or failed).

Determinism: Given identical (R, P, C’, T, model state), the pipeline produces identical (O, S). Every layer produces deterministic scores given identical inputs. The composite verification function is a deterministic mapping.

Tier-Dependent Proof Depth

Layer	Workshop	Refinery	Clean Room
L1: Calibrated confidence	Available	Available	Available
L2: Source entailment	Available	Available	Available
L3: Stability	Where API supports deterministic sampling	Full control	Full control
L4: Representation uncertainty	Not available	Available	Available
Contrastive attribution	Where API supports logprobs	Full logprob access	Full logprob + hardware attestation

Each tier increases proof depth. No tier claims epistemic omniscience.

Invariants

The model never evaluates its own output. The truth boundary is external. The Claim Ledger is an independent evaluation system.
Claim decomposition is deterministic. Identical response text always produces identical claim spans with identical token coordinates.
Each layer (L1-L4) produces an independent signal. No layer’s score depends on another layer’s output.
The composite verification decision is a deterministic function of available layer scores and deployment tier.
Output states are assigned by the Boundary Gate based on Claim Ledger output, not by heuristic or model self-assessment.
Contrastive attribution proves what evidence caused a claim. It does not prove absence of pre-training influence.
Every evaluation produces a structured record sufficient to reproduce the verification decision.
Layer availability is determined by deployment tier, not by configuration. Workshop cannot access L4 regardless of settings.
The observability pipeline runs after generation, before emission. Unsupported claims are blocked before they reach the caller.
The system is precise about what it proves (positive attribution) and transparent about what it does not (exclusion of pre-training priors).

This document may be shared for evaluation purposes. Redistribution requires written permission.

https://kenshikilabs.com/articles/ai-neurosurgery

Workshop

The governed public-model overlay where Kenshiki wraps external models in the full control plane.

Architecture Brief

Platform Architecture

The bounded-synthesis pipeline, trust boundaries, and enforcement model behind the Kura/Kadai contract.

Tool

Claim Ledger

The verification engine that decomposes responses into claims and checks each one against governed evidence.

Founding RFC

The Ingestion Pipeline

How raw documents become the Kura evidence boundary — the Phase 0 that feeds everything described in this article.