Kenshiki

Founding RFC

SIRE: Deterministic Identity for Governed Evidence

How Kenshiki decides what each source is, what it covers, what it relates to, and what it must never be used to answer — before the model sees anything.

2,435 words · ~12 min

Status: Founding RFC — Approved

Abstract

Standard RAG systems retrieve whatever is nearest in embedding space and hand it to the model. There is no authority boundary, no source provenance, and no way to enforce what evidence the model is allowed to see for a given question.

SIRE (Subject, Included, Relevant, Excluded) is the deterministic identity system that solves this. Every source document in Kura carries four SIRE fields stamped in its frontmatter during ingestion. These fields are not model-generated. They are not probabilistic. They are structural metadata that controls what the model can access — enforced at the database level, not at the prompt level.

The critical design decision: only Excluded enforces. Subject, Included, and Relevant inform discovery. Excluded gates retrieval.

The Problem SIRE Solves

Vector similarity cannot distinguish “physical access controls” from “logical access controls.” The words overlap, but the legal obligations do not. A SOC 2 auditor asking about Trust Service Criteria should never receive evidence from a HIPAA source, even if the embeddings are close — because the regulatory authority, compliance obligations, and evidentiary standards are different.

Without SIRE:

  • The model sees whatever cosine similarity surfaces, regardless of jurisdictional authority
  • A question about EU AI Act conformity assessment might retrieve NIST AI RMF guidance — related but legally distinct
  • There is no mechanism to prove what evidence the model was allowed to see
  • Post-generation scoring cannot fix what was never in scope

With SIRE:

  • Every chunk carries its source identity as structural metadata
  • Retrieval is scoped by SIRE subject grouping before ranking
  • The exclusion gate purges out-of-scope chunks before they reach the model
  • The Claim Ledger can trace every claim back to a specific, authorized evidence source

Formal SIRE Identity Definition

The SIRE identity of a versioned document dvd^v is a four-tuple:

SIRE(dv)=(S,I,R,E)\text{SIRE}(d^v) = (S,\, I,\, R,\, E)

Where:

  • S (Subject): a single normalized identifier anchoring the document to a regulatory or domain authority. Derived from the document’s oracle_id. Cardinality: exactly 1 per document version.
  • I (Included): an ordered set of terminology tokens that the document covers. Maximum cardinality: 24. Generated by multi-signal extraction (Algorithm 1) and human-approved.
  • R (Relevant): an ordered set of cross-domain references. Maximum cardinality: 12. Seeded from framework metadata, enriched by extraction, human-approved.
  • E (Excluded): an ordered set of boundary enforcement terms. Maximum cardinality: 8. Seeded from a global cross-framework exclusion set, filtered to remove self-referential terms, human-approved.

SIRE(dv)\text{SIRE}(d^v) is immutable once the document version enters the ACTIVE state. Updating any SIRE field requires creating a new document version dv+1d^{v+1} with its own SIRE tuple. The retrieval algorithm uses the SIRE tuple associated with the exact version whose content is stored in the index. This guarantees that any ex-post audit of a claim can reconstruct exactly which SIRE identity governed the evidence at the time, even if SIRE changed later.

The Four Fields

Subject

Anchors the source to a specific regulatory or domain identity. One subject per source. This is the primary key for evidence grouping.

Examples:

  • soc_2_trust_services_criteria
  • eu_ai_act
  • hipaa_privacy_security
  • nist_csf_2_0
  • iso_27001_2022

Subject is derived from the source’s oracle_id during ingestion, normalized to lowercase with underscores. It is not inferred from content — it is declared by the corpus engineer who curates the source.

Included

The terminology vocabulary that this source actually covers. Enriches hybrid search (pgvector + tsvector) by surfacing domain-specific terms that cosine similarity alone might miss.

Included terms are inferred from the source body using a multi-signal extraction pipeline (Algorithm 1). The pipeline caps at 24 included terms per source. Terms are deduplicated and ordered by frequency.

Included does not enforce. It informs retrieval ranking — chunks from sources whose included terms match the query get a relevance boost.

Relevant

Maps cross-source topology. Declares which other frameworks, standards, or domains are structurally related to this source.

Examples for a SOC 2 source:

  • ISO 27001 (overlapping control domains)
  • NIST CSF (framework crosswalk)
  • PCI DSS (shared security controls)

Relevant terms are seeded from the source’s frameworks metadata and enriched with capitalized terms and acronyms from the body. Capped at 12 terms. Terms already in Included are excluded to avoid duplication.

Relevant does not enforce. It enables the Crosswalk’s relationship discovery — when the system needs to answer a question that spans multiple frameworks, Relevant tells it which sources are structurally related.

Excluded

The hard boundary. Excluded terms define what this source must never be used to answer. This is the only SIRE field that enforces at retrieval time.

At retrieval, the exclusion gate runs after semantic + lexical ranking:

  • For each candidate chunk, check if any excluded term appears in the chunk content
  • Match is case-insensitive, word-boundary aligned
  • If a match is found, the chunk is purged from the candidate set before it reaches the model

The default exclusion list ensures cross-framework boundary integrity:

  • hipaa, gdpr, pci dss, eu ai act, nist ai rmf, nist csf, iso 27001, iso 42001, iso 23894, soc 2, sox

During SIRE inference, excluded terms that match the source’s own identity (subject, title, oracle_id, or framework values) are automatically removed — a SOC 2 source does not exclude “soc 2” from itself.

Algorithm 1: SIRE Identity Inference

Input: Document frontmatter F (oracle_id, title, frameworks), document body text T, configuration parameters P (domain phrase dictionary, frequency thresholds, cardinality caps).

Output: SIRE proposal tuple (S, I_proposed, R_proposed, E_proposed) in PROPOSED state.

Steps:

  1. Subject derivation. S = normalize(F.oracle_id): lowercase, replace non-alphanumeric with underscore, collapse consecutive underscores.

  2. Multi-signal term extraction. Four independent extractors run over T:

    • Phrase extractor: Scan T against P.domain_phrases (curated dictionary). Count occurrences. Emit terms with count >= 1.
    • Acronym extractor: Find all uppercase sequences matching [A-Z] repeated 2–8 times in T. Filter noise set (THE, AND, FOR, etc.). Emit terms with count >= 2.
    • Significant word extractor: Find all lowercase words of 4+ characters in T. Filter stop words. Emit terms with count >= 4.
    • Capitalized term extractor: Find multi-word proper nouns (1–4 capitalized words) in T. Emit terms with count >= 2.
  3. Included assembly. Merge phrase terms (priority 1), acronym terms (priority 2), and significant words (priority 3, excluding tokens already covered by phrases). Deduplicate by normalized form. If document is classified as reference_document, filter low-signal terms. Truncate to P.max_included (default 24).

  4. Relevant assembly. Seed from F.frameworks. Enrich with capitalized terms and acronyms. Remove any term already in I_proposed. Truncate to P.max_relevant (default 12).

  5. Excluded assembly. Start from P.default_sire_excludes (global cross-framework set). For each candidate term, compute self_text = normalize(F.oracle_id + S + F.title + F.frameworks). Remove any term whose normalized tokens are a subset of self_text tokens, or whose normalized form matches S. Truncate to P.max_excluded (default 8).

  6. Emit proposal. Return (S, I_proposed, R_proposed, E_proposed) with state=PROPOSED and a JSON evidence report containing top phrases, acronyms, words, and capitalized terms with their counts.

Determinism guarantee: Given identical (F, T, P), the algorithm always produces identical output. No randomness, no model inference, no external API calls.

State transitions: PROPOSED -> APPROVED (human corpus engineer reviews and accepts/modifies) -> ACTIVE (written to source frontmatter via atomic write). The pipeline runs in dry-run mode by default. The —apply flag transitions from APPROVED to ACTIVE.

Algorithm 2: SIRE-Governed Retrieval

Input:

  • Query q
  • Candidate chunk set C from hybrid search (pgvector semantic + tsvector lexical)
  • For each chunk c in C: the SIRE identity of its source document, SIRE(d_c) = (S_c, I_c, R_c, E_c)
  • Caller identity K and ReBAC policy graph G (via OpenFGA)

Output: Ordered evidence set C’ passed to the model.

Steps:

  1. Hybrid ranking. Compute hybrid score h(c, q) for each chunk c in C using weighted combination of cosine similarity (pgvector) and lexical match score (tsvector).

  2. Subject grouping. Group chunks by subject S_c. For each subject group, compute group_score = mean(h(c, q)) across chunks in the group. Rank subjects by group_score descending.

  3. Exclusion gate. For each chunk c from document d:

    • For each term e in E_d:
      • Compute match = case_insensitive_word_boundary_match(e, c.text)
      • If match: remove c from C, log exclusion event (chunk_id, excluded_term, source_subject)
    • This step is deterministic: same (C, SIRE state) always produces same exclusions.
  4. ReBAC authorization gate. For each remaining chunk c:

    • Query G to determine if caller K is authorized to access document d_c
    • If not authorized: remove c from C, log authorization denial (chunk_id, caller_id, document_id)
    • SIRE exclusion and ReBAC authorization are independent gates evaluated in sequence. A chunk must pass both.
  5. Final ordering. Rerank surviving chunks by h(c, q) within their subject groups. Emit ordered set C’.

Determinism guarantee: Given identical (q, corpus version, SIRE state, caller identity K, ReBAC graph G), the algorithm always produces identical C’. The composition of the two gates (SIRE exclusion intersected with ReBAC authorization) is invariant regardless of underlying storage implementation.

Two-Layer Policy Composition

SIRE and ReBAC operate as independent, composable gates:

  • SIRE defines regulatory and semantic eligibility. It answers: “Is this evidence authorized to be used for this type of question?” This is a property of the evidence itself, independent of who is asking.
  • ReBAC defines subject-object authorization. It answers: “Is this caller authorized to access this specific evidence?” This is a property of the relationship between the caller and the evidence, independent of the question.

The system evaluates both gates and takes the intersection of allowed chunks. A chunk that passes SIRE exclusion but fails ReBAC authorization is dropped. A chunk that passes ReBAC authorization but fails SIRE exclusion is also dropped. Neither gate can override the other.

This composition means:

  • Adding a new regulatory framework to Kura (new SIRE subjects) does not require updating ReBAC policies
  • Granting a new user access (new ReBAC relationships) does not weaken SIRE exclusion boundaries
  • The two systems can be managed, audited, and tested independently

Claim Ledger Integration and Audit Guarantees

For each generated answer, the system records:

  • The query q
  • The exact SIRE tuples for all source documents that contributed chunks to the final evidence set C’
  • The IDs of all candidate chunks before exclusion gating (set C), after exclusion gating, and after ReBAC gating (set C’)
  • The exclusion gate decisions: which chunks were purged and which excluded term triggered each purge
  • The ReBAC gate decisions: which chunks were purged and which authorization check failed
  • The final ordered evidence set actually passed to the model

Evidence-bound audit invariant: For any claim in a response, there exists a recorded evidence set such that every supporting chunk’s document identity satisfies both the SIRE exclusion rules and the caller’s ReBAC authorization at the time of the query. This invariant is verifiable: a verification function can re-execute the SIRE exclusion gate and ReBAC authorization check against the logged state to confirm that the recorded evidence set is consistent with the policies that were active at query time.

Reproducibility guarantee: Given the same (q, corpus version with SIRE state, caller identity K, ReBAC graph G), the retrieval algorithm produces the same C’. The Claim Ledger stores sufficient state to reproduce the retrieval decision for any historical query.

SIRE Boundary Compliance Scoring

For each query/answer pair, the system computes a SIRE boundary compliance score:

  • 1.0 (fully compliant): Every supporting chunk’s subject is in the expected subject set for the query domain, no excluded terms were present in any surviving chunk, and all chunks passed ReBAC authorization.
  • Degraded (0.5–0.99): Evidence comes from multiple subjects where cross-subject relevance is established only through the Relevant graph (not direct subject match). The answer is supported but draws from adjacent domains.
  • Non-compliant (< 0.5): Evidence gaps — the query domain has insufficient coverage in Kura, requiring the system to surface chunks from weakly related subjects or to refuse the query entirely.

This score is used:

  • To determine output state (COMPLIANT, DEGRADED_BOUNDARY, or REFUSED) before the answer reaches the caller
  • To label answers in audit logs for regulatory review
  • To trigger DEGRADED_BOUNDARY annotations in the Claim Ledger when cross-subject evidence is used

Determinism Properties

SIRE is a deterministic governance mechanism layered on top of stochastic models. Specifically:

Inference determinism. Given the same document text T, frontmatter F, and configuration parameters P (domain phrase dictionary, frequency thresholds, cardinality caps, default exclusion set), the SIRE proposal pipeline always produces the same proposal. No randomness. No model inference. No external API calls. The extraction is purely algorithmic — regex pattern matching, frequency counting, set operations.

Retrieval determinism. Given the same approved SIRE state, query q, corpus version, caller identity K, and ReBAC graph G, the retrieval algorithm always produces the same evidence set C’. The exclusion gate is a deterministic function of (chunk text, excluded terms). The ReBAC gate is a deterministic function of (caller identity, document identity, policy graph).

Audit determinism. Given the logged state at query time (SIRE tuples, ReBAC graph snapshot, corpus version), the retrieval decision can be exactly reproduced. The verification function produces a binary pass/fail — either the logged evidence set matches the recomputed evidence set, or it does not.

This three-layer determinism (inference, retrieval, audit) means that SIRE provides governance guarantees that do not depend on the behavior of the underlying language model. The model is treated as an untrusted synthesizer operating within a deterministic evidence boundary.

Invariants

  1. Every source in Kura must have all four SIRE fields before it enters the evidence boundary.
  2. Excluded is the only field that enforces. Subject, Included, and Relevant inform but never gate.
  3. SIRE proposals are generated by automated extraction but must be human-approved before application. State transitions: PROPOSED -> APPROVED -> ACTIVE.
  4. A source’s excluded list must never contain terms that match its own identity.
  5. The exclusion gate runs at retrieval time, not at ingestion time — changes to excluded terms take effect immediately without re-indexing.
  6. SIRE fields are immutable per document version. Updating SIRE requires a new version dv+1d^{v+1} of the source document.
  7. The retrieval algorithm uses the SIRE tuple associated with the exact version whose content is stored in the index. Chunks from different versions of the same document are never mixed.
  8. SIRE exclusion and ReBAC authorization are independent gates. The system takes the intersection. Neither can override the other.
  9. The Claim Ledger records sufficient state to reproduce any historical retrieval decision.
  10. The SIRE boundary compliance score is computed for every query/answer pair and recorded in the audit log.