RFC-0002: Ground Truth Infrastructure

Purpose

Define the authoritative data layer and verification protocol that enables CAA to reference external reality.

This RFC implements Doctrine III: Ontic Turbulence — the constraint that simulators cannot self-authorize. Without a specified ground truth infrastructure, governance is built on faith.

Part I: Authoritative Data Layer

The Ground Truth Problem

CAA governance requires comparing LLM proposals against reality. But "reality" must be:

Sourced — from authoritative origins
Ingested — into a queryable format
Versioned — with temporal validity
Provenance-tracked — with chain of custody from source to query

Without these guarantees, oracle verification is theater.

Source Registry

Every oracle MUST be registered in a Source Registry before it can be queried.

interface SourceRegistryEntry {
  // Identity
  oracle_id: string; // Stable identifier (e.g., "usda-fdc-v2")
  oracle_name: string; // Human-readable name
  oracle_tier: OracleTier; // Trust level

  // Origin
  upstream_authority: string; // Canonical source (e.g., "USDA Agricultural Research Service")
  upstream_url: string; // Official source location
  data_license: string; // License governing use (e.g., "Public Domain", "CC-BY-4.0")

  // Coverage
  domain: string; // Primary domain (e.g., "nutrition", "legal", "medical")
  axes_provided: string[]; // Ontology axes this oracle can populate
  geographic_scope?: string; // Jurisdiction (e.g., "US", "EU", "global")

  // Versioning
  current_version: string; // Active dataset/API version
  version_history: VersionRecord[];
  update_frequency: UpdateFrequency;

  // Integration
  adapter_id: string; // Which adapter handles queries
  query_contract: string; // Reference to query interface spec

  // Governance
  registered_at: string; // RFC 3339
  registered_by: string; // Human/system that registered
  review_status: "approved" | "pending" | "deprecated";
  deprecation_date?: string; // When this oracle should stop being used
}

type OracleTier = "primary" | "secondary" | "cross_domain" | "unverified";

type UpdateFrequency =
  | "realtime" // Streaming updates
  | "daily" // Daily batch refresh
  | "weekly" // Weekly release
  | "monthly" // Monthly release
  | "quarterly" // Quarterly release
  | "annual" // Annual release
  | "static" // No planned updates
  | "on_demand"; // Updated when explicitly triggered

interface VersionRecord {
  version: string;
  released_at: string;
  ingested_at: string;
  record_count?: number;
  checksum: string; // SHA-256 of ingested dataset
  changelog_url?: string;
}

Registry Invariants:

oracle_id MUST be globally unique and immutable once assigned
Multiple versions of the same oracle share the same oracle_id
Deprecation MUST provide at least 90 days notice before removal
axes_provided MUST reference valid ontology axes (RFC-0001)

Data Ingestion Pipeline

Authoritative data MUST flow through a controlled ingestion pipeline that preserves provenance.

interface IngestionPipeline {
  pipeline_id: string;
  oracle_id: string;

  // Source retrieval
  retrieval: {
    method: "api_pull" | "file_download" | "streaming" | "manual_upload";
    endpoint?: string;
    authentication?: "api_key" | "oauth" | "mtls" | "none";
    schedule?: string; // Cron expression for automated pulls
  };

  // Validation
  validation: {
    schema_id: string; // JSON Schema or similar for input validation
    required_fields: string[];
    reject_on_schema_fail: boolean;
    deduplication_key?: string[]; // Fields that define uniqueness
  };

  // Transformation
  transformation: {
    normalizer_id: string; // Reference to normalization function
    output_schema_id: string; // Normalized schema
    field_mappings: FieldMapping[];
  };

  // Storage
  storage: {
    destination: string; // Where normalized data is stored
    partition_key?: string; // For time-series or sharded storage
    retention_days: number; // How long to keep historical versions
  };

  // Audit
  audit: {
    log_raw_input: boolean; // Store original source data
    log_transformations: boolean; // Store transformation steps
    provenance_chain: boolean; // Generate provenance records
  };
}

interface FieldMapping {
  source_path: string; // JSONPath or similar
  target_axis: string; // Ontology axis
  transform?: string; // Optional transformation function
  unit_conversion?: {
    from_unit: string;
    to_unit: string;
  };
}

Ingestion Invariants:

Raw source data MUST be retrievable for audit (if log_raw_input: true)
Every transformation step MUST be deterministic and reproducible
Schema validation failure MUST NOT silently drop records
Ingestion MUST be idempotent — re-running produces identical output

Schema Normalization

Source data arrives in heterogeneous formats. Normalization maps source schemas to ontology axes.

interface NormalizationSpec {
  normalizer_id: string;
  source_schema_id: string; // Schema of incoming data
  target_ontology_id: string; // Target ontology (RFC-0001)

  // Field mappings
  axis_mappings: AxisMapping[];

  // Value normalization
  value_normalizers: ValueNormalizer[];

  // Validation
  post_normalization_checks: Check[];
}

interface AxisMapping {
  source_field: string;
  target_axis: string;
  cardinality: "one_to_one" | "one_to_many" | "many_to_one";
  required: boolean;
  default_value?: unknown;
}

interface ValueNormalizer {
  axis: string;
  normalizer_type:
    | "unit_conversion" // Convert between units
    | "enum_mapping" // Map source values to ontology enums
    | "date_parsing" // Parse dates to RFC 3339
    | "numeric_precision" // Round/truncate to specified precision
    | "string_normalization" // Case folding, whitespace trimming
    | "identity"; // Pass through unchanged
  config: Record<string, unknown>;
}

interface Check {
  check_type: "range" | "enum" | "regex" | "custom";
  axis: string;
  constraint: unknown;
  on_failure: "reject" | "warn" | "nullify";
}

Normalization Example (USDA FDC → Nutrition Ontology):

const usdaFdcNormalizer: NormalizationSpec = {
  normalizer_id: "usda-fdc-to-nutrition-v1",
  source_schema_id: "usda-fdc-food-v2",
  target_ontology_id: "nutrition-v1",

  axis_mappings: [
    {
      source_field: "fdcId",
      target_axis: "food_id",
      cardinality: "one_to_one",
      required: true,
    },
    {
      source_field: "description",
      target_axis: "food_name",
      cardinality: "one_to_one",
      required: true,
    },
    {
      source_field: "foodNutrients",
      target_axis: "nutrients",
      cardinality: "one_to_many",
      required: true,
    },
  ],

  value_normalizers: [
    {
      axis: "nutrients.*.amount",
      normalizer_type: "unit_conversion",
      config: {
        from_unit_field: "nutrients.*.unitName",
        to_unit: "g", // Normalize all to grams
        conversion_table: "usda-unit-conversions",
      },
    },
  ],

  post_normalization_checks: [
    {
      check_type: "range",
      axis: "nutrients.*.amount",
      constraint: { min: 0 },
      on_failure: "reject",
    },
  ],
};

Provenance Chain

Every record in the authoritative data layer MUST have a provenance chain linking it to its source.

interface ProvenanceRecord {
  record_id: string; // ID of the normalized record

  // Source linkage
  oracle_id: string;
  source_version: string;
  source_record_id: string; // ID in original source (if available)
  source_url?: string; // Direct link to source record

  // Ingestion metadata
  ingested_at: string; // RFC 3339
  ingestion_pipeline_id: string;
  ingestion_run_id: string; // Specific execution instance

  // Transformation audit
  raw_hash: string; // SHA-256 of raw source record
  normalized_hash: string; // SHA-256 of normalized record
  transformations_applied: string[]; // List of normalizer IDs applied

  // Validity
  valid_from: string; // RFC 3339 - when this version became active
  valid_until?: string; // RFC 3339 - when superseded (null if current)
  superseded_by?: string; // record_id of newer version

  // Verification
  verification_status: "unverified" | "auto_verified" | "human_verified";
  verified_by?: string;
  verified_at?: string;
}

Provenance Invariants:

Provenance MUST be queryable for any record returned by an oracle
raw_hash MUST be verifiable against archived raw data (if retained)
valid_from / valid_until enable point-in-time queries
Supersession chains MUST be acyclic

Query Contract

Oracle adapters query the authoritative data layer through a standardized contract.

interface OracleQueryContract {
  // Query specification
  query_id: string;
  oracle_id: string;

  // Request
  request: {
    axes_requested: string[]; // Which ontology axes to return
    filters: QueryFilter[]; // Constraints on returned records
    point_in_time?: string; // RFC 3339 - query as of this timestamp
    include_provenance: boolean; // Return provenance with results
    max_results?: number;
  };

  // Response
  response: {
    records: NormalizedRecord[];
    provenance?: ProvenanceRecord[];
    query_metadata: QueryMetadata;
  };
}

interface QueryFilter {
  axis: string;
  operator:
    | "eq"
    | "ne"
    | "gt"
    | "gte"
    | "lt"
    | "lte"
    | "in"
    | "contains"
    | "regex";
  value: unknown;
}

interface QueryMetadata {
  query_id: string;
  executed_at: string;
  latency_ms: number;
  records_scanned: number;
  records_returned: number;
  cache_hit: boolean;
  data_version: string; // Version of dataset queried
  freshness: {
    data_as_of: string; // Most recent update in result set
    staleness_seconds: number; // Age of oldest record in result
  };
}

Query Invariants:

Queries MUST be deterministic — same input produces same output
point_in_time queries MUST return data as it existed at that timestamp
Query results MUST include data_version for reproducibility
Freshness metadata MUST be accurate to enable staleness-aware decisions

Freshness & Invalidation

Authoritative data has temporal validity. Systems MUST track and enforce freshness.

interface FreshnessPolicy {
  oracle_id: string;

  // Freshness requirements per axis
  axis_freshness: Record<string, AxisFreshnessRequirement>;

  // Global defaults
  default_max_age_seconds: number;
  default_stale_behavior: StaleBehavior;

  // Invalidation triggers
  invalidation_triggers: InvalidationTrigger[];
}

interface AxisFreshnessRequirement {
  axis: string;
  max_age_seconds: number;
  stale_behavior: StaleBehavior;
  refresh_on_query: boolean; // Trigger refresh if stale when queried
}

type StaleBehavior =
  | "serve_stale_with_warning" // Return data but flag as stale
  | "serve_stale_silent" // Return data without warning (use carefully)
  | "block_until_fresh" // Wait for refresh before returning
  | "fail_closed" // Return error, require manual intervention
  | "degrade_tier"; // Treat as lower-tier oracle

interface InvalidationTrigger {
  trigger_type:
    | "time_based" // Invalidate after TTL
    | "version_change" // Invalidate when source version changes
    | "upstream_signal" // Invalidate on webhook/notification
    | "manual"; // Human-triggered invalidation
  config: Record<string, unknown>;
}

Freshness Example (Medical Lab Values):

const labFreshness: FreshnessPolicy = {
  oracle_id: "ehr-lab-results",

  axis_freshness: {
    inr_value: {
      axis: "inr_value",
      max_age_seconds: 86400, // 24 hours
      stale_behavior: "fail_closed",
      refresh_on_query: false, // Cannot refresh without new lab draw
    },
    medication_list: {
      axis: "medication_list",
      max_age_seconds: 3600, // 1 hour
      stale_behavior: "serve_stale_with_warning",
      refresh_on_query: true,
    },
  },

  default_max_age_seconds: 86400,
  default_stale_behavior: "serve_stale_with_warning",

  invalidation_triggers: [
    {
      trigger_type: "upstream_signal",
      config: { webhook: "/api/ehr/invalidate" },
    },
  ],
};

Acceptance Criteria (Part I)

A system is compliant with Part I if:

All oracles are registered in a Source Registry with required fields
Data flows through documented ingestion pipelines
Normalization is deterministic and reproducible
Provenance chains are queryable for all records
Query contracts return freshness metadata
Staleness is detected and handled per policy

Part II: Oracle Verification Protocol

Oracle Definition

An oracle is any externally referenceable, auditable source, including: • Databases • APIs • Standards documents • Policy artifacts • Signed internal references • Human locks

Oracle Trust Hierarchy

Oracles are classified into trust tiers that determine precedence during conflict resolution:

Tier	Role	Examples
`primary`	Authoritative for domain	USDA FDC for nutrition, State Bar for attorney status
`secondary`	Supplementary, lower precedence	Third-party nutrition databases, legal analytics services
`cross_domain`	Valid for related domain only	Medical database used for nutrition interactions
`unverified`	Untrusted until validation	User-provided data, scraped sources

Trust tier is declared in the oracle source registry, not inferred.

Verification

Each authoritative output must declare: • Oracle reference • Verification method • Resolution layer

Oracle Integrity Requirements (Normative)

The verification declaration MUST be sufficiently specific to support audit, replay detection, and dispute resolution. At minimum, each authoritative output MUST be attributable to a concrete oracle artifact and verification event.

Minimum fields (conceptual schema):

interface OracleVerificationRecord {
  oracle_id: string; // Stable registry identifier (not a free-text name)
  oracle_tier: "primary" | "secondary" | "cross_domain" | "unverified";
  source_locator: string; // URL/URN/path/query template sufficient to retrieve the artifact
  source_version?: string; // Doc version / dataset release / API version / commit hash
  retrieved_at: string; // RFC 3339 timestamp
  verified_at: string; // RFC 3339 timestamp
  verification_method: string; // e.g., signature verification, checksum, controlled ingestion
  verifier_id: string; // System/human reviewer identity
  evidence_hash?: string; // Hash of retrieved artifact or normalized evidence bundle
  signature_info?: {
    signed_by: string;
    signature_alg: string;
    signature_valid: boolean;
  };
  ttl_seconds?: number; // Cache lifetime; expiry MUST be enforced for state-sensitive axes
  chain_of_custody?: string[]; // Optional; required for regulated/audit-heavy domains
}

Integrity constraints: • If an oracle supports signatures or immutable versioning, systems MUST verify and record it. • For state-sensitive axes, ttl_seconds MUST be bounded and expiry MUST force re-verification. • If evidence cannot be versioned or hashed, the system MUST treat the result as lower-trust and MAY require human review depending on domain policy. • Systems MUST be resilient to oracle poisoning: registries MUST pin oracle identities, and unverified/user-supplied sources MUST NOT outrank verified registries.

Oracle Conflict Decision Procedure

When multiple oracles provide values for the same axis:

Different-tier oracles: Higher-tier oracle wins by default
- Exception: If value delta exceeds escalation_delta_threshold, escalate regardless
Same-tier oracles: Apply same_tier_strategy:
- require_human: Always escalate to human review (default)
- most_recent: Use most recently verified value
- use_conservative: Use more restrictive value (for numeric/safety-critical)
- dispute_summary: Emit dispute_summary envelope, do not decide
Axes in always_human_axes: Always require human review regardless of tier

Conflict Resolution Strategies

Strategy	When to Use	Behavior
`higher_tier_wins`	Default for cross-tier conflicts	Primary beats secondary beats cross_domain
`most_recent`	Time-sensitive data	Most recently verified value wins
`require_human`	Safety-critical domains	Always escalate, never auto-resolve
`use_conservative`	Risk mitigation	Use lower/more restrictive value
`weighted_average`	Numeric values with confidence	Weight by tier: primary=4, secondary=2, cross=1
`dispute_summary`	Audit-heavy domains	Emit conflict to client, don't decide

Conflict Configuration

interface ConflictResolutionConfig {
  default_strategy: ConflictResolutionStrategy;
  same_tier_strategy: ConflictResolutionStrategy;
  escalation_delta_threshold?: number; // Numeric difference that forces escalation
  always_human_axes?: string[]; // Axes that always require human review
  recency_window_seconds?: number; // Time window for recency comparison

  // Edge case handling (Amendment)
  scope: ConflictScope;
}

interface ConflictScope {
  granularity: "oracle" | "axis"; // Conflict resolution at oracle level or per-axis
  cascade_limit: number; // Max resolution depth before escalation (default: 3)
  partial_conflict_behavior: "isolate" | "invalidate_all";
}

Edge Case: Cascading Conflicts

When Oracle A conflicts with Oracle B and resolution picks A, but A also conflicts with Oracle C:

cascade_limit defines maximum resolution depth
If limit exceeded → escalate to human review
Each cascade step is logged for audit

Edge Case: Partial Conflicts

When Oracle A provides axes X and Y, Oracle B provides axes Y and Z, and they conflict only on Y:

`granularity`	Behavior
`"axis"`	Resolve Y conflict; X from A and Z from B remain valid
`"oracle"`	Conflict on Y invalidates both oracles; escalate

`partial_conflict_behavior`	Behavior
`"isolate"`	Only the conflicting axis requires resolution
`"invalidate_all"`	All axes from conflicting oracles require re-verification

Conflicting oracles result in: • dispute_summary output type (if strategy is dispute_summary) • Automatic resolution with audit log (if strategy is deterministic) • Human escalation (if require_human or threshold exceeded)

Oracle Latency Budgets (Amendment)

When oracles are slow or unreachable, the system must have defined behavior:

interface OracleLatencyPolicy {
  soft_timeout_ms: number; // Attempt retrieval (e.g., 2000ms)
  hard_timeout_ms: number; // Abort (e.g., 5000ms)
  on_timeout: "FAIL_OPEN_NARRATIVE" | "FAIL_CLOSED_BLOCK" | "SERVE_STALE";
  stale_tolerance_seconds?: number; // Max age for cached oracle data
}

Timeout Behaviors:

`on_timeout`	Behavior	Use When
`FAIL_OPEN_NARRATIVE`	Return NARRATIVE_ONLY with disclaimer	Low-risk domains, user experience priority
`FAIL_CLOSED_BLOCK`	Return BLOCKED with retry guidance	High-stakes domains (medicine, law, finance)
`SERVE_STALE`	Use cached data if within tolerance	Time-insensitive data with reliable cache

Stale Data Policy:

If stale_tolerance_seconds is set and cached data exists within tolerance, use it
Cached data MUST include original observed_at timestamp in provenance
Stale data triggers warning in audit log

Oracle Reliability Requirements

External oracles are critical infrastructure. Systems MUST implement reliability measures to handle oracle failure modes.

Failure Modes

Mode	Description	Detection
Unavailability	Oracle unreachable or returning 5xx errors	Connection timeout, HTTP status
Latency	Response time exceeds acceptable thresholds	Timer comparison to policy
Conflict	Multiple oracles disagree on same axis	Value comparison during resolution
Poisoning	Oracle returns manipulated or invalid data	Signature verification, schema validation

Timeout Defaults

Threshold	Default Value	Purpose
Soft timeout	2,000 ms	Begin fallback preparation
Hard timeout	5,000 ms	Abort oracle call, execute fallback
Retry backoff	Exponential (100ms base, 2x multiplier)	Prevent thundering herd
Max retries	3	Bound retry attempts before escalation

Authority Latency Policy

Workflows MUST declare their Authority Latency Tolerance (ALT) and corresponding enforcement mode:

interface AuthorityLatencyPolicy {
  authority_latency_budget_ms: number; // Maximum acceptable latency for authoritative decision
  enforcement_mode: "INLINE" | "ASYNC" | "NON_AUTHORITATIVE_ONLY";
  on_timeout_behavior: "ESCALATE" | "DEGRADE_TO_NARRATIVE" | "FAIL_CLOSED";
}

Enforcement Modes:

Mode	When to Use	Behavior
`INLINE`	ALT > oracle soft_timeout	Oracle verification runs inline; output waits for verification
`ASYNC`	ALT < oracle soft_timeout, but authoritative output required	Queue for human review or delayed decision; return provisional status
`NON_AUTHORITATIVE_ONLY`	Latency-critical or assistive workflows	No oracle verification; outputs restricted to narrative mode with forbidden marker enforcement

Binding Rules:

If workflow is authoritative AND enforcement_mode: "INLINE", oracle verification is required per current RFC
If workflow is authoritative BUT authority_latency_budget_ms < oracle_soft_timeout_ms, enforcement MUST route to:
- ASYNC: Queue human review or delayed decision, return REQUIRES_SPECIFICATION or provisional envelope
- NON_AUTHORITATIVE_ONLY: Return narrative-only envelope with explicit "unverified" status
Workflows with NON_AUTHORITATIVE_ONLY MUST NOT emit measurement, classification, or action envelopes
Systems MUST validate that declared authority_latency_budget_ms is compatible with enforcement_mode

Example Configuration:

// Nutrition calculation (latency-tolerant)
{
  authority_latency_budget_ms: 3000,
  enforcement_mode: "INLINE",
  on_timeout_behavior: "ESCALATE"
}

// Real-time chat guidance (latency-critical)
{
  authority_latency_budget_ms: 150,
  enforcement_mode: "NON_AUTHORITATIVE_ONLY",
  on_timeout_behavior: "DEGRADE_TO_NARRATIVE"
}

// Time-critical adjudication (authoritative but low-latency)
{
  authority_latency_budget_ms: 800,
  enforcement_mode: "ASYNC",
  on_timeout_behavior: "ESCALATE"
}

This policy makes the latency-authority tradeoff explicit and auditable.

Conflict Resolution Hierarchy

When multiple sources provide conflicting values, resolve according to axis type:

Self-Reported Axes (address, preferences, stated intent):

Human confirmation — Explicit user confirmation takes precedence
Deterministic lookup — Primary-tier oracles with cryptographic verification
Cryptographic proof — Signed, timestamped evidence with valid chain of custody
Recency — Most recently verified value (only when explicitly configured)

Measured/Regulated Axes (lab values, legal status, financial balances):

Deterministic lookup — Primary-tier oracles (EHR, court records, bank APIs)
Cryptographic proof — Signed, timestamped evidence with valid chain of custody
Human confirmation — Treated as unverified unless co-signed by sensor/oracle
Conservative default — Most restrictive value for safety-critical domains

Warning: For measured/regulated axes, human confirmation alone is insufficient. A user stating "my INR is 2.5" when the EHR shows 4.8 MUST route to HumanReview with explicit oracle-vs-user conflict flagged. The system MUST NOT authorize based on user self-report for axes where sensor data is available.

Oracle Health Monitoring

Systems MUST implement oracle health monitoring:

interface OracleHealthConfig {
  health_check_interval_seconds: number; // Default: 60
  failure_threshold: number; // Consecutive failures before unhealthy (default: 3)
  recovery_threshold: number; // Consecutive successes to restore (default: 2)
  circuit_breaker_enabled: boolean; // Stop calling failed oracles temporarily
  circuit_breaker_reset_seconds: number; // Time before retry after circuit opens
}

interface OracleHealthStatus {
  oracle_id: string;
  status: "healthy" | "degraded" | "unhealthy" | "circuit_open";
  last_success_at: string | null;
  last_failure_at: string | null;
  consecutive_failures: number;
  latency_p50_ms: number;
  latency_p99_ms: number;
}

Versioned Oracle Logging

All oracle interactions MUST be logged with sufficient detail for audit reconstruction:

interface OracleInteractionLog {
  execution_id: string;
  oracle_id: string;
  oracle_version: string; // API version, dataset version, or schema version
  request_timestamp: string; // RFC 3339
  response_timestamp: string | null;
  latency_ms: number;
  status: "success" | "timeout" | "error" | "conflict";
  error_code?: string;
  response_hash?: string; // SHA-256 of response body
  cache_hit: boolean;
  stale_data_used: boolean;
}

Acceptance Criteria (Combined)

A system is compliant with RFC-0002 if:

Part I — Authoritative Data Layer:

All oracles are registered in a Source Registry with required fields
Data flows through documented ingestion pipelines with provenance
Normalization is deterministic and reproducible
Provenance chains are queryable for all records
Query contracts return freshness metadata
Staleness is detected and handled per policy

Part II — Oracle Verification Protocol: 7. Every authoritative output includes OracleVerificationRecord 8. Conflict resolution follows declared strategy 9. Timeout behavior matches declared policy 10. Health monitoring detects and handles failure modes 11. All oracle interactions are logged with sufficient detail for audit

Relationship to Other RFCs

RFC	Relationship
RFC-0001	Ontology defines axes that oracles populate
RFC-0004	State extraction queries oracles for verification
RFC-0006	Evidence binding references oracle provenance
RFC-0007	Oracle query logic is opaque to LLM
RFC-0009	Authorization envelopes include oracle verification records
Appendix-Audit	Oracle interactions are logged per telemetry requirements

⸻