Home
Documentation
Air-Gapped Ephemeral Ingestion Pipeline

Founding RFC

Air-Gapped Ephemeral Ingestion Pipeline

The architecture design for the data ingestion phase preceding the bounded-synthesis pipeline — converting unstructured documents into mathematically stable, ephemeral vector boundaries inside Kura.

1,474 words · ~7 min

Status: Founding RFC — Approved Context: Architecture design for the data ingestion phase preceding the Kenshiki bounded-synthesis pipeline. Objective: Establish a high-throughput, air-gapped data stream to convert unstructured documents into mathematically stable, SIRE-tagged, ephemeral vector boundaries inside Kura.

1. Abstract

The Kenshiki bounded-synthesis pipeline requires a mathematically bounded evidence corpus to act as the absolute ground truth. Standard ingestion relies on slow Python scripts, external APIs, and persistent indexes — creating leaky abstractions and I/O bottlenecks. This RFC defines a highly concurrent, local-only ingestion DAG (Directed Acyclic Graph) utilizing GPU-accelerated parsing (Docling), streaming embeddings, and bulk pgvector operations. It establishes strict hardware partitioning and fault-tolerance mechanisms to materialize SIRE-tagged geometric boundaries without starving the downstream inference engine.

2. Problem Statement

Leaky Abstractions: Parsers like Docling frequently default to external vision-language models to understand complex layouts, violating air-gapped security.
Database Locking and Index Bloat: Individual SQL INSERT statements will lock pgvector tables. Building persistent vector indexes (HNSW) on ephemeral data wastes compute.
The Poison Pill: Standard pipelines crash completely if a single corrupt PDF cannot be parsed, destroying the entire intelligence run.
Memory Contention: Running embedding models and LLM inference on the same unpartitioned GPU guarantees Out-Of-Memory errors and cache evictions.

3. Formal Pipeline Definition

The ingestion pipeline is defined as an ordered sequence of five stages with typed inputs and outputs:

$\text{Ingest}(D, P) \to (K, B, M)$

Where:

D: set of raw source documents (PDF, DOCX, JSON, Markdown, YAML, CSV)
P: pipeline configuration (chunk size, overlap, embedding model, SIRE config)
K: set of SIRE-tagged, embedded chunks in Kura
B: geometric boundary (centroid $\mu$ , shrunk covariance matrix $\Sigma$ )
M: pipeline metadata (DLQ quarantine list, telemetry, DEGRADED_BOUNDARY flag)

Determinism guarantee: Given identical (D, P, embedding model version), the pipeline produces identical (K, B). The chunking, embedding, SIRE tagging, and boundary calculation are deterministic. The only non-determinism is in GPU floating-point rounding for embeddings, which is bounded to machine epsilon.

4. Algorithm: The Ingestion DAG

Stage 1: Extract (Air-Gapped Parser)

Input: Raw source documents D. Output: Structured Markdown M_d for each document d in D, or quarantine event. Execution: GPU-accelerated layout analysis (DocLayNet), table structure extraction (TableFormer), and OCR (EasyOCR) via Docling. All network calls to external APIs are blocked at the infrastructure level — the parser operates in complete air-gap. Fault tolerance: If parsing fails after 3 retries, the document is quarantined in the Dead Letter Queue (DLQ). A FATAL_PARSE trace is emitted. The pipeline continues processing remaining documents.

Stage 2: Transform (Deterministic Chunking)

Input: Structured Markdown M_d. Output: Set of chunks C_d with metadata (source hash, section boundaries, token coordinates). Algorithm: Section-aware chunking on heading boundaries. Chunks are split at heading transitions, with merge for undersized chunks (below minimum token threshold). Overlap of 50 tokens between adjacent chunks preserves cross-boundary context. Enrichment (deterministic, applied before embedding):

Clause ID extraction for regulatory citations (e.g., “DFARS 252.204-7012” as a single entity)
Normative language detection (SHALL/MUST/REQUIRED flags)
Cross-reference resolution between sections
Quality gate rejects OCR garbage, TOC entries, and low-density chunks
SIRE identity tags stamped on every chunk during enrichment (see SIRE Identity System specification)

Determinism guarantee: Given identical (M_d, P), chunking produces identical C_d. The chunking algorithm is purely positional — heading detection, token counting, and boundary merging are deterministic string operations.

Stage 3: Embed (Streaming)

Input: Set of chunks C_d. Output: Set of embedded chunks E_d, each carrying a dense vector representation. Execution: Concurrent streaming to a local embedding server (text-embedding-3-large, 512-dimension Matryoshka). Dynamic batching at the server level maximizes GPU throughput. Provenance: Each embedded chunk carries SHA-256 source hash for idempotent upsert and version-aware change detection. HMAC-SHA-256 watermarks per chunk enable self-contained verification without database access.

Stage 4: Load (Bulk Sink)

Input: Embedded chunks E_d. Output: Chunks written to run-specific table in PostgreSQL with pgvector. Execution: Vectors are batched in worker memory and written using the COPY command. No HNSW or IVFFlat index is built — the ephemeral table uses Exact Nearest Neighbor (KNN) sequential scans. For corpora under 10,000 chunks, sequential scan is faster than index construction and guarantees exact recall. Tenant isolation: Every row carries tenant provenance enforced by database CHECK constraints and row-level security (RLS).

Stage 5: Lock (Boundary Calculation)

Input: All embedded chunks in the run-specific table. Output: Geometric boundary (centroid $\mu$ , Ledoit-Wolf shrunk covariance matrix $\Sigma$ ). Execution: A single procedural sweep computes the payload centroid and the shrunk, invertible covariance matrix. The Ledoit-Wolf estimator handles the anisotropic, non-Gaussian clustering typical of embedding spaces and remains stable under small payload sizes. Purpose: The boundary establishes an ellipsoidal control limit using the $\chi^2$ distribution. This acts as a geometric plausibility check for the downstream Claim Ledger — a sanity boundary, not semantic proof. Mahalanobis distance against $(\mu, \Sigma)$ is the gating metric. Determinism guarantee: Given identical embedded chunks, the boundary calculation produces identical $(\mu, \Sigma)$ . Ledoit-Wolf shrinkage is a deterministic closed-form estimator.

5. The No-Index Rule

Engineering instinct dictates building HNSW or IVFFlat indexes on vector tables. For the Kenshiki ingestion pipeline, this is strictly forbidden:

The Kura evidence boundary for a given run rarely exceeds 10,000 chunks. Building an HNSW graph on 10,000 vectors takes longer than scanning them.
Ephemeral tables remain unindexed. Retrieval uses cosine distance for candidate ranking via exact KNN sequential scans. This eliminates indexing overhead and guarantees exact recall.
The actual boundary gate downstream uses Mahalanobis distance against $(\mu, \Sigma)$ — cosine is a retrieval mechanism, not the gating decision.
The table is automatically truncated or dropped upon artifact delivery.

6. Fault Tolerance (The Poison Pill DLQ)

If a user submits 40 documents and Document 12 is a corrupted scan, the pipeline must not panic and destroy the run, nor silently ignore the failure.

The state machine:

Parsing layer fails after 3 retries.
Worker quarantines the file and emits FATAL_PARSE trace.
DAG continues processing remaining 39 documents.
Boundary calculation executes on the 39 successful documents.
Final Claim Ledger is flagged with DEGRADED_BOUNDARY, listing the exact file name excluded from the evidence boundary.

DEGRADED_BOUNDARY invariant: When one or more documents are quarantined, the output states (AUTHORIZED, PARTIAL, REQUIRES_SPEC, NARRATIVE_ONLY, BLOCKED) still apply normally. The DEGRADED_BOUNDARY annotation does not change the output state — it adds provenance metadata so reviewers know the evidence scope was narrower than intended.

7. Compute Isolation

The ingestion pipeline must be physically or logically isolated from the inference engine. TEI/embedding on the same unpartitioned GPU as vLLM fragments memory and cripples continuous batching.

Hardware split (production): Embedding and parsing run on a dedicated GPU instance (e.g., g6.xlarge with L4), keeping the inference engine isolated on its own GPU hardware.
Logical split (development): NVIDIA MIG or strict CUDA memory fractions fence VRAM for embedding, leaving the rest untouched for vLLM.

8. Audit Guarantees

For every ingestion run, the pipeline records:

Source document hashes (SHA-256) and file metadata
DLQ quarantine events with failure reasons and retry counts
Chunk boundaries with token coordinates and section provenance
SIRE identity tuples applied to each chunk
Embedding model version and configuration
Geometric boundary parameters $(\mu, \Sigma)$ with Ledoit-Wolf shrinkage coefficient
Total chunk count, embedding throughput, and pipeline latency

Reproducibility invariant: Given identical (D, P, embedding model version), the pipeline produces identical chunks with identical SIRE tags and identical geometric boundaries. The audit log stores sufficient state to verify that any historical ingestion run produced the expected output.

9. Telemetry Primitives

Structured telemetry traces:

Worker extraction latency (time-to-Markdown per document)
Embedding throughput (tokens embedded per second)
Bulk load size (chunks written per COPY batch)
DLQ quarantine events (the Poison Pill)
Boundary calculation time and shrinkage coefficient
SIRE tagging completeness (percentage of chunks with all four SIRE fields)

10. Invariants

Every chunk in Kura must carry all four SIRE identity fields before entering the evidence boundary.
Every chunk carries SHA-256 source hash and HMAC-SHA-256 watermark for self-contained verification.
Tenant provenance is enforced by database CHECK constraints on every row — not application logic.
The ingestion pipeline operates in complete air-gap. No network calls to external APIs during parsing.
Ephemeral tables are never indexed. Retrieval uses exact KNN sequential scans.
The geometric boundary $(\mu, \Sigma)$ is computed once per ingestion run and is immutable for that run.
A quarantined document triggers DEGRADED_BOUNDARY on the Claim Ledger but does not halt the pipeline.
Chunking is deterministic: identical input documents with identical pipeline configuration always produce identical chunks.
The embedding model version is recorded with every chunk. Model version changes require re-ingestion.
The pipeline is idempotent: re-ingesting the same document set produces the same chunks (modulo GPU floating-point rounding at machine epsilon).

This document may be shared for evaluation purposes. Redistribution requires written permission.

https://kenshikilabs.com/articles/ingestion-pipeline

The HAIC Framework

The Tri-Pass architecture that consumes the evidence boundaries this pipeline produces.

Founding RFC

The SIRE Identity System

The deterministic tagging methodology applied during ingestion to define what each source is and what it must never answer.

Tool

Kura Index

The evidence store that holds the indexed, embedded, provenance-stamped output of this pipeline.

Current Architecture

Platform Architecture

How the ingestion pipeline fits into the full Kenshiki bounded-synthesis system.