Oracle Pipeline - Overview

An oracle is a ground truth source that the AI system must respect at runtime. Platform oracles encode regulatory frameworks, industry standards, and governance boundaries as structured markdown documents. They are the authoritative reference for Goober's grounded responses and the GATE's enforcement decisions.

Oracles are not training data, prompt templates, or cached API responses. They are versioned, hashed, embedded, and retrievable — with full provenance from source document to individual claim.

There are three oracle content types: prose (regulatory/standards content as paragraphs, e.g., HIPAA, GDPR, SOX, NIST AI RMF — used for grounding), boundary (topics requiring professional redirect, e.g., legal advice, medical practice — used for redirect), and structured (row-level lookup data, reserved and not yet built, e.g., FDA nutrition, ICD codes — used for field match).

There are three tiers: tier_1 (regulatory or legal mandate with enforcement risk — HIPAA, GDPR, EU AI Act, SOX), tier_2 (industry standard that may be contractually required — NIST AI RMF, ISO 42001), and tier_3 (best practice or advisory, not mandated — gating ontology, boundaries).

Oracle Pipeline - Authoring

Oracle source documents live in packages/content/oracles/. One file per oracle. Filenames should match the oracle_id but the frontmatter field is canonical.

Every oracle markdown file has YAML frontmatter validated by the loader. Required fields: oracle_id (unique, lowercase, [a-z][a-z0-9_-]*), title (display title), tier (tier_1, tier_2, or tier_3), frameworks (non-empty array of framework identifiers), industries (wizard taxonomy slugs or ["*"]), segments (wizard taxonomy slugs or ["*"]), source_url (primary source URL), source_publisher (publisher organization), last_verified (ISO date YYYY-MM-DD), version (monotonic; bump triggers re-embedding). Optional: content_type (defaults to "prose").

Validation rules are enforced in packages/content/src/oracles.server.ts:validateFrontmatter(). Gray-matter parses YAML; the validator coerces last_verified from Date objects and version from numbers.

Each ## heading becomes one chunk. Sub-headings (### and below) stay within the parent chunk. Prose oracles follow required sections per spec-oracle-authoring-schema.md §2: Overview, What It Is, Who It Applies To, What It Requires (repeatable), Governance Implications, Enforcement Penalties, Intersection With Other Frameworks, Recent Updates. Boundary oracles use a different internal structure per spec-oracle-authoring-schema.md §6: each ## section is one boundary topic with trigger, frameworks, rationale, and scripted redirect.

Oracle Pipeline - TypeScript Layer

Types are defined in packages/content/src/types.ts. The OracleTier type is "tier_1" | "tier_2" | "tier_3". The OracleContentType is "prose" | "boundary" | "structured". The OracleFrontmatter interface includes all frontmatter fields. The Oracle interface extends OracleFrontmatter with content (markdown body without frontmatter) and filePath (absolute path to source file). The OracleChunkRaw interface includes sequence (0-based order), section_title (the ## heading text), heading_level (always 2), content (full text under heading), content_hash (SHA-256), token_count (estimated tokens via words/0.75), and heading_path (currently single-element).

The TS types are stricter than the DB schema: source_url, source_publisher, and last_verified are required strings in TS but nullable in the SQL columns. The loader enforces non-null; the DB allows nullability for future flexibility.

The loader module packages/content/src/oracles.server.ts is Node.js server-only. Exports: getOracles() (load all .md files, validate frontmatter), getOracle(id) (single oracle by oracle_id), getOracleIds() (all oracle IDs), chunkOracle(o) (split on ## headings, hash, estimate tokens), hashOracleContent(o) (SHA-256 of full body for change detection). Path resolution via resolveContentDir("oracles") handles monorepo, Next.js, and standalone script contexts.

The public import path is @ontic/content/oracles, exported via packages/content/package.json exports map. The barrel (src/oracles.ts) re-exports all functions from oracles.server.ts and all types from types.ts.

Oracle Pipeline - Database Schema

Three tables in the public schema, created by 20260217000000_oracle_content_store.sql.

The oracle_documents table stores one row per ingested oracle version. Key columns: id (UUID PK), oracle_id (TEXT, matches frontmatter), version (TEXT, matches frontmatter), title, tier (oracle_tier enum), content_type (oracle_content_type enum), frameworks/industries/segments (TEXT arrays, denormalized for filter pushdown), source_url, source_publisher, last_verified, content_hash (SHA-256 of full markdown body), chunk_count, is_active (BOOLEAN, only one active version per oracle_id), ingested_at, ingested_by. Constraints: UNIQUE(oracle_id, version) for idempotent upsert key; partial unique index on oracle_id WHERE is_active = true to enforce one active version.

The oracle_chunks table stores one row per ## section. Key columns: id (UUID PK), document_id (FK → oracle_documents, CASCADE delete), oracle_id (denormalized), sequence, section_title, content, content_hash (SHA-256 for RFC-0007 evidence binding), token_count, tier/content_type/frameworks/industries/segments (all denormalized from document), embedding (vector(1536) for pgvector), embedding_status (chunk_embedding_status enum), embedded_at, content_tsv (TSVECTOR auto-generated from content). Indexes: HNSW on embedding (m=24, ef_construction=200, cosine distance), GIN on content_tsv for full-text search, GIN on frameworks/industries/segments for array overlap filters, B-tree on oracle_id/tier/content_type/embedding_status, composite (document_id, sequence) unique.

The oracle_indexes table tracks provenance per RFC-0007: one row per embedding run, with oracle_id, oracle_version, document_id, embedding_model, chunk_count, index_hash (integrity hash of all chunks), freshness_policy_days (default 90).

Custom types: oracle_tier enum (tier_1, tier_2, tier_3), oracle_content_type enum (prose, boundary, structured), chunk_embedding_status enum (pending, processing, complete, failed, stale).

Views: oracle_documents_summary (document rollup with embedded_chunks, pending_chunks, failed_chunks, latest_embedding_model), oracle_chunks_detail (recent 30 chunks with metadata without content for admin dashboards).

Oracle Pipeline - Ingestion

The ingestion script is scripts/oracles/ingest.ts, run via npx tsx or bun run oracle:ingest. It supports --dry-run (validate only, no writes) and --only <id> (single oracle). Environment: reads SUPABASE_URL (or NEXT_PUBLIC_SUPABASE_URL) and SUPABASE_SERVICE_ROLE_KEY from apps/website/.env.local.

The ingestion flow: (1) getOracles() loads all .md files and validates frontmatter, (2) chunkOracle() splits each on ## headings, (3) hashOracleContent() computes SHA-256 of full body, (4) upsert_oracle_document() RPC call per oracle (idempotent — returns unchanged if same id+version+hash, updated if same id+version but new hash with cascade delete of old chunks, inserted if new id or version with deactivation of previous versions), (5) batch insert chunks 50 at a time with denormalized metadata (tier, content_type, frameworks, industries, segments copied from document), (6) report inserted/updated/unchanged/failed counts.

Chunks are inserted with embedding_status: 'pending'. Embeddings are generated separately by the embedding pipeline.

Oracle Pipeline - Embedding

The embedding model is text-embedding-3-small (OpenAI) producing 1536-dimension vectors stored in a pgvector vector(1536) column. The HNSW index uses cosine distance (vector_cosine_ops) with m=24, ef_construction=200.

The embedding status lifecycle: pending → processing → complete (success path) or pending → processing → failed (error path). Chunks transition to stale when freshness_policy_days is exceeded. The chunk_embedding_status enum tracks where each chunk is in the pipeline.

In addition to vector embeddings, each chunk has a content_tsv tsvector column auto-maintained by a PostgreSQL trigger (trg_oracle_chunks_tsv) using tsvector_update_trigger() with pg_catalog.english text search configuration. This enables the hybrid search path.

The embedding script is scripts/oracles/backfill-embeddings.ts, run via bun run oracle:embed. Supports --dry-run and --limit <n>.

Oracle Pipeline - Retrieval Functions

The semantic search function match_oracle_chunks(query_embedding, match_count, match_threshold, filters...) performs pure vector similarity search using cosine distance. Returns chunks ordered by similarity score. Filters: filter_industries, filter_segments, filter_tier, filter_frameworks, filter_content_type, filter_oracle_ids. All optional — NULL means no filter. Wildcard arrays (["*"]) match any filter value.

The hybrid search function match_oracle_chunks_hybrid(query_embedding, query_text, match_count, match_threshold, semantic_weight, filters...) combines vector similarity and BM25 full-text search via Reciprocal Rank Fusion: score = (semantic_weight × 1/(60 + sem_rank)) + (text_weight × 1/(60 + ft_rank)). This is necessary because regulatory queries mix conceptual intent (e.g., "what does HIPAA say about data encryption") with precise legal terms (e.g., "45 CFR § 164.312"). Pure semantic search struggles with exact regulatory identifiers; pure keyword search misses conceptual similarity. RRF blends both signals.

Input validation: match_count clamped to 1–100, semantic_weight must be 0.0–1.0 (default 0.7), match_threshold must be 0.0–1.0 (default 0.005), empty query_text gracefully disables the full-text branch.

Array filters use PostgreSQL array overlap (&&) with wildcard passthrough: filter_industries IS NULL OR c.industries && filter_industries OR c.industries @> ARRAY['*']. This means an oracle with industries: ["*"] matches any industry filter — cross-industry frameworks like GDPR are always retrievable.

Both functions are SECURITY DEFINER with explicit search_path = public, extensions. Execute grants: authenticated and anon can execute both match functions, service_role (superuser) can additionally execute upsert_oracle_document. Row-level security on the underlying tables restricts authenticated/anon to is_active = true documents.

Oracle Pipeline - Content Hashing and Evidence Binding

Every layer uses SHA-256 for integrity. Document-level: full markdown body is hashed for change detection during idempotent upsert. Chunk-level: section content is hashed for RFC-0007 evidence binding (claim → source). Index-level: all chunk hashes are combined for embedding index integrity verification.

The content_hash on each chunk is the binding point for RFC-0007: when Goober cites a claim, the evidence record includes the chunk's content_hash, proving which exact content informed the response. If the content changes (new version), the hash changes, and stale evidence is detectable.

Oracle Pipeline - Admin Console

The admin oracles page (/admin/oracles) is a read-only observability surface for the pipeline. It reads from five data sources: oracle_documents_summary (SQL view for document rollup with embedding coverage stats), oracle_chunks_detail (SQL view for recent chunk activity, 30 most recent), oracle_chunks table (status distribution: complete/pending/failed/stale), getOracles() (TypeScript loader for markdown sources for inspection), and gating-ontology.yaml (file, active deterministic gate policy).

KPIs: embedding coverage (embedded_chunks / total_chunks, healthy ≥ 95%), queue pressure (pending + processing + stale, healthy = 0 outside ingestion), failure count (failed chunks, healthy = 0), ingestion drift (markdown file count ≠ DB doc count, healthy = 0 difference).

Operator workflows: post-ingestion verification (confirm new/updated documents, validate chunk counts and versions), embedding backfill monitoring (watch status distribution, detect failed/stale chunks), source inspection (read raw markdown, review gating ontology), operator handoff (live state reference during incidents).

Access is restricted to authenticated users with the admin role. The page is read-only — it does not trigger ingestion, embedding, or remediation actions.

Oracle Pipeline - Platform vs Customer Oracles

The pipeline described here handles platform oracles — compliance frameworks maintained by Ontic (HIPAA, GDPR, SOX, etc.). These live in the oracle_documents / oracle_chunks tables.

A separate schema handles customer-authored oracles — per-organization knowledge bases created through the oracle builder. Customer oracles use oracle_domains and oracle_sources tables, are authored via the oracle builder UI (not markdown in repo), are ingested via API (not CLI script), are scoped per-organization (not global), use a separate retrieval path, and live in migration 20260209222845_* (vs 20260217000000_* for platform oracles).

The two systems share conceptual vocabulary (tiers, content types, evidence binding) but have separate storage and retrieval paths.

Oracle Pipeline - Migrations

Applied in order: 20260217000000_oracle_content_store.sql (base tables, functions, indexes, RLS, views), 20260217000001_oracle_content_store_hardening.sql (active version uniqueness, input validation, grants), 20260217000002_hybrid_search_add_filter_frameworks.sql (add filter_frameworks to hybrid search).

Migrations are append-only. Already-applied migrations are never edited. Forward-fixes go in new migrations with explicit DROP FUNCTION for signature changes.

Oracle Pipeline - Related Specifications

spec-oracle-authoring-schema defines the authoring format, section rules, and writing guidelines for oracle markdown files. spec-oracle-retrieval defines the retrieval architecture, claim verification, and staleness policies. spec-goober-handoff defines how Goober receives oracle-grounded context at chat time, including the profile → oracle retrieval → system prompt injection flow.

Oracle Pipeline — End-to-End Lifecycle