RFC-0002: Ground Truth Infrastructure
Purpose
Define the authoritative data layer and verification protocol that enables CAA to reference external reality.
This RFC implements Doctrine III: Ontic Turbulence — the constraint that simulators cannot self-authorize. Without a specified ground truth infrastructure, governance is built on faith.
Part I: Authoritative Data Layer
The Ground Truth Problem
CAA governance requires comparing LLM proposals against reality. But "reality" must be:
- Sourced — from authoritative origins
- Ingested — into a queryable format
- Versioned — with temporal validity
- Provenance-tracked — with chain of custody from source to query
Without these guarantees, oracle verification is theater.
Source Registry
Every oracle MUST be registered in a Source Registry before it can be queried.
interface SourceRegistryEntry {
// Identity
oracle_id: string; // Stable identifier (e.g., "usda-fdc-v2")
oracle_name: string; // Human-readable name
oracle_tier: OracleTier; // Trust level
// Origin
upstream_authority: string; // Canonical source (e.g., "USDA Agricultural Research Service")
upstream_url: string; // Official source location
data_license: string; // License governing use (e.g., "Public Domain", "CC-BY-4.0")
// Coverage
domain: string; // Primary domain (e.g., "nutrition", "legal", "medical")
axes_provided: string[]; // Ontology axes this oracle can populate
geographic_scope?: string; // Jurisdiction (e.g., "US", "EU", "global")
// Versioning
current_version: string; // Active dataset/API version
version_history: VersionRecord[];
update_frequency: UpdateFrequency;
// Integration
adapter_id: string; // Which adapter handles queries
query_contract: string; // Reference to query interface spec
// Governance
registered_at: string; // RFC 3339
registered_by: string; // Human/system that registered
review_status: "approved" | "pending" | "deprecated";
deprecation_date?: string; // When this oracle should stop being used
}
type OracleTier = "primary" | "secondary" | "cross_domain" | "unverified";
type UpdateFrequency =
| "realtime" // Streaming updates
| "daily" // Daily batch refresh
| "weekly" // Weekly release
| "monthly" // Monthly release
| "quarterly" // Quarterly release
| "annual" // Annual release
| "static" // No planned updates
| "on_demand"; // Updated when explicitly triggered
interface VersionRecord {
version: string;
released_at: string;
ingested_at: string;
record_count?: number;
checksum: string; // SHA-256 of ingested dataset
changelog_url?: string;
}
Registry Invariants:
oracle_idMUST be globally unique and immutable once assigned- Multiple versions of the same oracle share the same
oracle_id - Deprecation MUST provide at least 90 days notice before removal
axes_providedMUST reference valid ontology axes (RFC-0001)
Data Ingestion Pipeline
Authoritative data MUST flow through a controlled ingestion pipeline that preserves provenance.
interface IngestionPipeline {
pipeline_id: string;
oracle_id: string;
// Source retrieval
retrieval: {
method: "api_pull" | "file_download" | "streaming" | "manual_upload";
endpoint?: string;
authentication?: "api_key" | "oauth" | "mtls" | "none";
schedule?: string; // Cron expression for automated pulls
};
// Validation
validation: {
schema_id: string; // JSON Schema or similar for input validation
required_fields: string[];
reject_on_schema_fail: boolean;
deduplication_key?: string[]; // Fields that define uniqueness
};
// Transformation
transformation: {
normalizer_id: string; // Reference to normalization function
output_schema_id: string; // Normalized schema
field_mappings: FieldMapping[];
};
// Storage
storage: {
destination: string; // Where normalized data is stored
partition_key?: string; // For time-series or sharded storage
retention_days: number; // How long to keep historical versions
};
// Audit
audit: {
log_raw_input: boolean; // Store original source data
log_transformations: boolean; // Store transformation steps
provenance_chain: boolean; // Generate provenance records
};
}
interface FieldMapping {
source_path: string; // JSONPath or similar
target_axis: string; // Ontology axis
transform?: string; // Optional transformation function
unit_conversion?: {
from_unit: string;
to_unit: string;
};
}
Ingestion Invariants:
- Raw source data MUST be retrievable for audit (if
log_raw_input: true) - Every transformation step MUST be deterministic and reproducible
- Schema validation failure MUST NOT silently drop records
- Ingestion MUST be idempotent — re-running produces identical output
Schema Normalization
Source data arrives in heterogeneous formats. Normalization maps source schemas to ontology axes.
interface NormalizationSpec {
normalizer_id: string;
source_schema_id: string; // Schema of incoming data
target_ontology_id: string; // Target ontology (RFC-0001)
// Field mappings
axis_mappings: AxisMapping[];
// Value normalization
value_normalizers: ValueNormalizer[];
// Validation
post_normalization_checks: Check[];
}
interface AxisMapping {
source_field: string;
target_axis: string;
cardinality: "one_to_one" | "one_to_many" | "many_to_one";
required: boolean;
default_value?: unknown;
}
interface ValueNormalizer {
axis: string;
normalizer_type:
| "unit_conversion" // Convert between units
| "enum_mapping" // Map source values to ontology enums
| "date_parsing" // Parse dates to RFC 3339
| "numeric_precision" // Round/truncate to specified precision
| "string_normalization" // Case folding, whitespace trimming
| "identity"; // Pass through unchanged
config: Record<string, unknown>;
}
interface Check {
check_type: "range" | "enum" | "regex" | "custom";
axis: string;
constraint: unknown;
on_failure: "reject" | "warn" | "nullify";
}
Normalization Example (USDA FDC → Nutrition Ontology):
const usdaFdcNormalizer: NormalizationSpec = {
normalizer_id: "usda-fdc-to-nutrition-v1",
source_schema_id: "usda-fdc-food-v2",
target_ontology_id: "nutrition-v1",
axis_mappings: [
{
source_field: "fdcId",
target_axis: "food_id",
cardinality: "one_to_one",
required: true,
},
{
source_field: "description",
target_axis: "food_name",
cardinality: "one_to_one",
required: true,
},
{
source_field: "foodNutrients",
target_axis: "nutrients",
cardinality: "one_to_many",
required: true,
},
],
value_normalizers: [
{
axis: "nutrients.*.amount",
normalizer_type: "unit_conversion",
config: {
from_unit_field: "nutrients.*.unitName",
to_unit: "g", // Normalize all to grams
conversion_table: "usda-unit-conversions",
},
},
],
post_normalization_checks: [
{
check_type: "range",
axis: "nutrients.*.amount",
constraint: { min: 0 },
on_failure: "reject",
},
],
};
Provenance Chain
Every record in the authoritative data layer MUST have a provenance chain linking it to its source.
interface ProvenanceRecord {
record_id: string; // ID of the normalized record
// Source linkage
oracle_id: string;
source_version: string;
source_record_id: string; // ID in original source (if available)
source_url?: string; // Direct link to source record
// Ingestion metadata
ingested_at: string; // RFC 3339
ingestion_pipeline_id: string;
ingestion_run_id: string; // Specific execution instance
// Transformation audit
raw_hash: string; // SHA-256 of raw source record
normalized_hash: string; // SHA-256 of normalized record
transformations_applied: string[]; // List of normalizer IDs applied
// Validity
valid_from: string; // RFC 3339 - when this version became active
valid_until?: string; // RFC 3339 - when superseded (null if current)
superseded_by?: string; // record_id of newer version
// Verification
verification_status: "unverified" | "auto_verified" | "human_verified";
verified_by?: string;
verified_at?: string;
}
Provenance Invariants:
- Provenance MUST be queryable for any record returned by an oracle
raw_hashMUST be verifiable against archived raw data (if retained)valid_from/valid_untilenable point-in-time queries- Supersession chains MUST be acyclic
Query Contract
Oracle adapters query the authoritative data layer through a standardized contract.
interface OracleQueryContract {
// Query specification
query_id: string;
oracle_id: string;
// Request
request: {
axes_requested: string[]; // Which ontology axes to return
filters: QueryFilter[]; // Constraints on returned records
point_in_time?: string; // RFC 3339 - query as of this timestamp
include_provenance: boolean; // Return provenance with results
max_results?: number;
};
// Response
response: {
records: NormalizedRecord[];
provenance?: ProvenanceRecord[];
query_metadata: QueryMetadata;
};
}
interface QueryFilter {
axis: string;
operator:
| "eq"
| "ne"
| "gt"
| "gte"
| "lt"
| "lte"
| "in"
| "contains"
| "regex";
value: unknown;
}
interface QueryMetadata {
query_id: string;
executed_at: string;
latency_ms: number;
records_scanned: number;
records_returned: number;
cache_hit: boolean;
data_version: string; // Version of dataset queried
freshness: {
data_as_of: string; // Most recent update in result set
staleness_seconds: number; // Age of oldest record in result
};
}
Query Invariants:
- Queries MUST be deterministic — same input produces same output
point_in_timequeries MUST return data as it existed at that timestamp- Query results MUST include
data_versionfor reproducibility - Freshness metadata MUST be accurate to enable staleness-aware decisions
Freshness & Invalidation
Authoritative data has temporal validity. Systems MUST track and enforce freshness.
interface FreshnessPolicy {
oracle_id: string;
// Freshness requirements per axis
axis_freshness: Record<string, AxisFreshnessRequirement>;
// Global defaults
default_max_age_seconds: number;
default_stale_behavior: StaleBehavior;
// Invalidation triggers
invalidation_triggers: InvalidationTrigger[];
}
interface AxisFreshnessRequirement {
axis: string;
max_age_seconds: number;
stale_behavior: StaleBehavior;
refresh_on_query: boolean; // Trigger refresh if stale when queried
}
type StaleBehavior =
| "serve_stale_with_warning" // Return data but flag as stale
| "serve_stale_silent" // Return data without warning (use carefully)
| "block_until_fresh" // Wait for refresh before returning
| "fail_closed" // Return error, require manual intervention
| "degrade_tier"; // Treat as lower-tier oracle
interface InvalidationTrigger {
trigger_type:
| "time_based" // Invalidate after TTL
| "version_change" // Invalidate when source version changes
| "upstream_signal" // Invalidate on webhook/notification
| "manual"; // Human-triggered invalidation
config: Record<string, unknown>;
}
Freshness Example (Medical Lab Values):
const labFreshness: FreshnessPolicy = {
oracle_id: "ehr-lab-results",
axis_freshness: {
inr_value: {
axis: "inr_value",
max_age_seconds: 86400, // 24 hours
stale_behavior: "fail_closed",
refresh_on_query: false, // Cannot refresh without new lab draw
},
medication_list: {
axis: "medication_list",
max_age_seconds: 3600, // 1 hour
stale_behavior: "serve_stale_with_warning",
refresh_on_query: true,
},
},
default_max_age_seconds: 86400,
default_stale_behavior: "serve_stale_with_warning",
invalidation_triggers: [
{
trigger_type: "upstream_signal",
config: { webhook: "/api/ehr/invalidate" },
},
],
};
Acceptance Criteria (Part I)
A system is compliant with Part I if:
- All oracles are registered in a Source Registry with required fields
- Data flows through documented ingestion pipelines
- Normalization is deterministic and reproducible
- Provenance chains are queryable for all records
- Query contracts return freshness metadata
- Staleness is detected and handled per policy
Part II: Oracle Verification Protocol
Oracle Definition
An oracle is any externally referenceable, auditable source, including: • Databases • APIs • Standards documents • Policy artifacts • Signed internal references • Human locks
Oracle Trust Hierarchy
Oracles are classified into trust tiers that determine precedence during conflict resolution:
| Tier | Role | Examples |
|---|---|---|
primary | Authoritative for domain | USDA FDC for nutrition, State Bar for attorney status |
secondary | Supplementary, lower precedence | Third-party nutrition databases, legal analytics services |
cross_domain | Valid for related domain only | Medical database used for nutrition interactions |
unverified | Untrusted until validation | User-provided data, scraped sources |
Trust tier is declared in the oracle source registry, not inferred.
Verification
Each authoritative output must declare: • Oracle reference • Verification method • Resolution layer
Oracle Integrity Requirements (Normative)
The verification declaration MUST be sufficiently specific to support audit, replay detection, and dispute resolution. At minimum, each authoritative output MUST be attributable to a concrete oracle artifact and verification event.
Minimum fields (conceptual schema):
interface OracleVerificationRecord {
oracle_id: string; // Stable registry identifier (not a free-text name)
oracle_tier: "primary" | "secondary" | "cross_domain" | "unverified";
source_locator: string; // URL/URN/path/query template sufficient to retrieve the artifact
source_version?: string; // Doc version / dataset release / API version / commit hash
retrieved_at: string; // RFC 3339 timestamp
verified_at: string; // RFC 3339 timestamp
verification_method: string; // e.g., signature verification, checksum, controlled ingestion
verifier_id: string; // System/human reviewer identity
evidence_hash?: string; // Hash of retrieved artifact or normalized evidence bundle
signature_info?: {
signed_by: string;
signature_alg: string;
signature_valid: boolean;
};
ttl_seconds?: number; // Cache lifetime; expiry MUST be enforced for state-sensitive axes
chain_of_custody?: string[]; // Optional; required for regulated/audit-heavy domains
}
Integrity constraints:
• If an oracle supports signatures or immutable versioning, systems MUST verify and record it.
• For state-sensitive axes, ttl_seconds MUST be bounded and expiry MUST force re-verification.
• If evidence cannot be versioned or hashed, the system MUST treat the result as lower-trust and MAY require human review depending on domain policy.
• Systems MUST be resilient to oracle poisoning: registries MUST pin oracle identities, and unverified/user-supplied sources MUST NOT outrank verified registries.
Oracle Conflict Decision Procedure
When multiple oracles provide values for the same axis:
-
Different-tier oracles: Higher-tier oracle wins by default
- Exception: If value delta exceeds
escalation_delta_threshold, escalate regardless
- Exception: If value delta exceeds
-
Same-tier oracles: Apply
same_tier_strategy:require_human: Always escalate to human review (default)most_recent: Use most recently verified valueuse_conservative: Use more restrictive value (for numeric/safety-critical)dispute_summary: Emit dispute_summary envelope, do not decide
-
Axes in
always_human_axes: Always require human review regardless of tier
Conflict Resolution Strategies
| Strategy | When to Use | Behavior |
|---|---|---|
higher_tier_wins | Default for cross-tier conflicts | Primary beats secondary beats cross_domain |
most_recent | Time-sensitive data | Most recently verified value wins |
require_human | Safety-critical domains | Always escalate, never auto-resolve |
use_conservative | Risk mitigation | Use lower/more restrictive value |
weighted_average | Numeric values with confidence | Weight by tier: primary=4, secondary=2, cross=1 |
dispute_summary | Audit-heavy domains | Emit conflict to client, don't decide |
Conflict Configuration
interface ConflictResolutionConfig {
default_strategy: ConflictResolutionStrategy;
same_tier_strategy: ConflictResolutionStrategy;
escalation_delta_threshold?: number; // Numeric difference that forces escalation
always_human_axes?: string[]; // Axes that always require human review
recency_window_seconds?: number; // Time window for recency comparison
// Edge case handling (Amendment)
scope: ConflictScope;
}
interface ConflictScope {
granularity: "oracle" | "axis"; // Conflict resolution at oracle level or per-axis
cascade_limit: number; // Max resolution depth before escalation (default: 3)
partial_conflict_behavior: "isolate" | "invalidate_all";
}
Edge Case: Cascading Conflicts
When Oracle A conflicts with Oracle B and resolution picks A, but A also conflicts with Oracle C:
cascade_limitdefines maximum resolution depth- If limit exceeded → escalate to human review
- Each cascade step is logged for audit
Edge Case: Partial Conflicts
When Oracle A provides axes X and Y, Oracle B provides axes Y and Z, and they conflict only on Y:
granularity | Behavior |
|---|---|
"axis" | Resolve Y conflict; X from A and Z from B remain valid |
"oracle" | Conflict on Y invalidates both oracles; escalate |
partial_conflict_behavior | Behavior |
|---|---|
"isolate" | Only the conflicting axis requires resolution |
"invalidate_all" | All axes from conflicting oracles require re-verification |
Conflicting oracles result in:
• dispute_summary output type (if strategy is dispute_summary)
• Automatic resolution with audit log (if strategy is deterministic)
• Human escalation (if require_human or threshold exceeded)
Oracle Latency Budgets (Amendment)
When oracles are slow or unreachable, the system must have defined behavior:
interface OracleLatencyPolicy {
soft_timeout_ms: number; // Attempt retrieval (e.g., 2000ms)
hard_timeout_ms: number; // Abort (e.g., 5000ms)
on_timeout: "FAIL_OPEN_NARRATIVE" | "FAIL_CLOSED_BLOCK" | "SERVE_STALE";
stale_tolerance_seconds?: number; // Max age for cached oracle data
}
Timeout Behaviors:
on_timeout | Behavior | Use When |
|---|---|---|
FAIL_OPEN_NARRATIVE | Return NARRATIVE_ONLY with disclaimer | Low-risk domains, user experience priority |
FAIL_CLOSED_BLOCK | Return BLOCKED with retry guidance | High-stakes domains (medicine, law, finance) |
SERVE_STALE | Use cached data if within tolerance | Time-insensitive data with reliable cache |
Stale Data Policy:
- If
stale_tolerance_secondsis set and cached data exists within tolerance, use it - Cached data MUST include original
observed_attimestamp in provenance - Stale data triggers warning in audit log
Oracle Reliability Requirements
External oracles are critical infrastructure. Systems MUST implement reliability measures to handle oracle failure modes.
Failure Modes
| Mode | Description | Detection |
|---|---|---|
| Unavailability | Oracle unreachable or returning 5xx errors | Connection timeout, HTTP status |
| Latency | Response time exceeds acceptable thresholds | Timer comparison to policy |
| Conflict | Multiple oracles disagree on same axis | Value comparison during resolution |
| Poisoning | Oracle returns manipulated or invalid data | Signature verification, schema validation |
Timeout Defaults
| Threshold | Default Value | Purpose |
|---|---|---|
| Soft timeout | 2,000 ms | Begin fallback preparation |
| Hard timeout | 5,000 ms | Abort oracle call, execute fallback |
| Retry backoff | Exponential (100ms base, 2x multiplier) | Prevent thundering herd |
| Max retries | 3 | Bound retry attempts before escalation |
Authority Latency Policy
Workflows MUST declare their Authority Latency Tolerance (ALT) and corresponding enforcement mode:
interface AuthorityLatencyPolicy {
authority_latency_budget_ms: number; // Maximum acceptable latency for authoritative decision
enforcement_mode: "INLINE" | "ASYNC" | "NON_AUTHORITATIVE_ONLY";
on_timeout_behavior: "ESCALATE" | "DEGRADE_TO_NARRATIVE" | "FAIL_CLOSED";
}
Enforcement Modes:
| Mode | When to Use | Behavior |
|---|---|---|
INLINE | ALT > oracle soft_timeout | Oracle verification runs inline; output waits for verification |
ASYNC | ALT < oracle soft_timeout, but authoritative output required | Queue for human review or delayed decision; return provisional status |
NON_AUTHORITATIVE_ONLY | Latency-critical or assistive workflows | No oracle verification; outputs restricted to narrative mode with forbidden marker enforcement |
Binding Rules:
- If workflow is authoritative AND
enforcement_mode: "INLINE", oracle verification is required per current RFC - If workflow is authoritative BUT
authority_latency_budget_ms < oracle_soft_timeout_ms, enforcement MUST route to:ASYNC: Queue human review or delayed decision, returnREQUIRES_SPECIFICATIONor provisional envelopeNON_AUTHORITATIVE_ONLY: Return narrative-only envelope with explicit "unverified" status
- Workflows with
NON_AUTHORITATIVE_ONLYMUST NOT emit measurement, classification, or action envelopes - Systems MUST validate that declared
authority_latency_budget_msis compatible withenforcement_mode
Example Configuration:
// Nutrition calculation (latency-tolerant)
{
authority_latency_budget_ms: 3000,
enforcement_mode: "INLINE",
on_timeout_behavior: "ESCALATE"
}
// Real-time chat guidance (latency-critical)
{
authority_latency_budget_ms: 150,
enforcement_mode: "NON_AUTHORITATIVE_ONLY",
on_timeout_behavior: "DEGRADE_TO_NARRATIVE"
}
// Time-critical adjudication (authoritative but low-latency)
{
authority_latency_budget_ms: 800,
enforcement_mode: "ASYNC",
on_timeout_behavior: "ESCALATE"
}
This policy makes the latency-authority tradeoff explicit and auditable.
Conflict Resolution Hierarchy
When multiple sources provide conflicting values, resolve according to axis type:
Self-Reported Axes (address, preferences, stated intent):
- Human confirmation — Explicit user confirmation takes precedence
- Deterministic lookup — Primary-tier oracles with cryptographic verification
- Cryptographic proof — Signed, timestamped evidence with valid chain of custody
- Recency — Most recently verified value (only when explicitly configured)
Measured/Regulated Axes (lab values, legal status, financial balances):
- Deterministic lookup — Primary-tier oracles (EHR, court records, bank APIs)
- Cryptographic proof — Signed, timestamped evidence with valid chain of custody
- Human confirmation — Treated as
unverifiedunless co-signed by sensor/oracle - Conservative default — Most restrictive value for safety-critical domains
Warning: For measured/regulated axes, human confirmation alone is insufficient. A user stating "my INR is 2.5" when the EHR shows 4.8 MUST route to HumanReview with explicit oracle-vs-user conflict flagged. The system MUST NOT authorize based on user self-report for axes where sensor data is available.
Oracle Health Monitoring
Systems MUST implement oracle health monitoring:
interface OracleHealthConfig {
health_check_interval_seconds: number; // Default: 60
failure_threshold: number; // Consecutive failures before unhealthy (default: 3)
recovery_threshold: number; // Consecutive successes to restore (default: 2)
circuit_breaker_enabled: boolean; // Stop calling failed oracles temporarily
circuit_breaker_reset_seconds: number; // Time before retry after circuit opens
}
interface OracleHealthStatus {
oracle_id: string;
status: "healthy" | "degraded" | "unhealthy" | "circuit_open";
last_success_at: string | null;
last_failure_at: string | null;
consecutive_failures: number;
latency_p50_ms: number;
latency_p99_ms: number;
}
Versioned Oracle Logging
All oracle interactions MUST be logged with sufficient detail for audit reconstruction:
interface OracleInteractionLog {
execution_id: string;
oracle_id: string;
oracle_version: string; // API version, dataset version, or schema version
request_timestamp: string; // RFC 3339
response_timestamp: string | null;
latency_ms: number;
status: "success" | "timeout" | "error" | "conflict";
error_code?: string;
response_hash?: string; // SHA-256 of response body
cache_hit: boolean;
stale_data_used: boolean;
}
Acceptance Criteria (Combined)
A system is compliant with RFC-0002 if:
Part I — Authoritative Data Layer:
- All oracles are registered in a Source Registry with required fields
- Data flows through documented ingestion pipelines with provenance
- Normalization is deterministic and reproducible
- Provenance chains are queryable for all records
- Query contracts return freshness metadata
- Staleness is detected and handled per policy
Part II — Oracle Verification Protocol: 7. Every authoritative output includes OracleVerificationRecord 8. Conflict resolution follows declared strategy 9. Timeout behavior matches declared policy 10. Health monitoring detects and handles failure modes 11. All oracle interactions are logged with sufficient detail for audit
Relationship to Other RFCs
| RFC | Relationship |
|---|---|
| RFC-0001 | Ontology defines axes that oracles populate |
| RFC-0004 | State extraction queries oracles for verification |
| RFC-0006 | Evidence binding references oracle provenance |
| RFC-0007 | Oracle query logic is opaque to LLM |
| RFC-0009 | Authorization envelopes include oracle verification records |
| Appendix-Audit | Oracle interactions are logged per telemetry requirements |
⸻