CrucibleIR Hexagonal Mark

CrucibleIR

Intermediate Representation for the Crucible ML reliability ecosystem. Full docs: https://hexdocs.pm/crucible_ir

Overview

CrucibleIR provides shared data structures for defining ML reliability experiments across the Crucible ecosystem. It serves as the common language for experiment configuration, enabling consistency across all Crucible tools and components.

Requirements

Elixir ~> 1.14 (and matching Erlang/OTP)
jason for JSON encoding (included in deps)

Features

Experiment Definition: Complete experiment specifications with backends, pipelines, and datasets
Reliability Configurations: Ensemble voting, hedging, statistical testing, fairness, and guardrails
Type Safety: Full type specifications for all structs
JSON Serialization: All structs derive Jason.Encoder for easy serialization
Comprehensive Documentation: 100% documentation coverage with examples

Installation

Add crucible_ir to your list of dependencies in mix.exs:

def deps do
  [
    {:crucible_ir, "~> 0.1.0"}
  ]
end

Fetch dependencies:

mix deps.get

Quick Start

alias CrucibleIR.{Experiment, BackendRef, StageDef, DatasetRef}
alias CrucibleIR.Reliability.{Config, Ensemble, Stats}

# Define a simple experiment
experiment = CrucibleIR.new_experiment(
  id: :gpt4_benchmark,
  backend: %BackendRef{id: :openai_gpt4},
  pipeline: [
    %StageDef{name: :preprocessing},
    %StageDef{name: :inference},
    %StageDef{name: :evaluation}
  ],
  dataset: %DatasetRef{name: :mmlu, split: :test}
)

# Add reliability mechanisms
experiment = %{experiment |
  reliability: %Config{
    ensemble: %Ensemble{
      strategy: :majority,
      models: [:gpt4, :claude, :gemini],
      execution_mode: :parallel
    },
    stats: %Stats{
      tests: [:ttest, :bootstrap],
      alpha: 0.05
    }
  }
}

# Serialize to JSON
{:ok, json} = Jason.encode(experiment)

Usage Workflow

Define an Experiment with id, backend, and pipeline stages.
Add a DatasetRef if the experiment targets a dataset.
Attach Reliability.Config options (ensemble, hedging, stats, fairness, guardrails).
Add OutputSpec entries to describe where and how to emit results.
Serialize with Jason.encode/1 to pass the IR into other Crucible services.

Core Components

Experiment Definition

Experiment - Top-level experiment definition
BackendRef - Reference to an LLM backend
DatasetRef - Reference to a dataset
StageDef - Processing stage definition
OutputSpec - Output specification

Reliability Mechanisms

Reliability.Config - Container for all reliability configurations
Reliability.Ensemble - Multi-model ensemble voting
Reliability.Hedging - Request hedging for tail latency reduction
Reliability.Stats - Statistical testing configuration
Reliability.Fairness - Fairness and bias detection
Reliability.Guardrail - Security guardrails (prompt injection, PII, etc.)

Struct Field Reference

Experiment: required id, backend, pipeline; optional description, owner, tags, metadata, dataset, reliability, outputs, created_at, updated_at.
BackendRef: required id; optional profile (default :default), options.
DatasetRef: required name; optional provider (default :crucible_datasets), split (default :train), options.
StageDef: required name; optional module, options, enabled (default true).
OutputSpec: required name; optional formats (default [:markdown]), sink (default :file), options.
Reliability.Config: optional ensemble, hedging, stats, fairness, guardrails.
- Ensemble: strategy (default :none), execution_mode (default :parallel), models, weights, min_agreement, timeout_ms, options.
- Hedging: strategy (default :off), delay_ms, percentile, max_hedges, budget_percent, options.
- Stats: tests (default [:ttest, :bootstrap]), alpha (default 0.05), confidence_level, effect_size_type, multiple_testing_correction, bootstrap_iterations, options.
- Fairness: enabled (default false), metrics, group_by, threshold, fail_on_violation, options.
- Guardrail: profiles (default [:default]), prompt_injection_detection, jailbreak_detection, pii_detection, pii_redaction, content_moderation, fail_on_detection, options.

Examples

Ensemble Voting Experiment

experiment = CrucibleIR.new_experiment(
  id: :ensemble_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  reliability: %Config{
    ensemble: %Ensemble{
      strategy: :weighted,
      models: [:gpt4, :claude, :gemini],
      weights: %{gpt4: 0.5, claude: 0.3, gemini: 0.2},
      execution_mode: :parallel
    }
  }
)

Hedging for Low Latency

experiment = CrucibleIR.new_experiment(
  id: :low_latency_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  reliability: %Config{
    hedging: %Hedging{
      strategy: :percentile,
      percentile: 0.95,
      max_hedges: 2,
      budget_percent: 15
    }
  }
)

Statistical Testing

experiment = CrucibleIR.new_experiment(
  id: :stats_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  dataset: %DatasetRef{name: :mmlu},
  reliability: %Config{
    stats: %Stats{
      tests: [:ttest, :mannwhitney, :bootstrap],
      alpha: 0.01,
      effect_size_type: :cohens_d,
      bootstrap_iterations: 10000
    }
  }
)

Fairness Checking

experiment = CrucibleIR.new_experiment(
  id: :fairness_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  reliability: %Config{
    fairness: %Fairness{
      enabled: true,
      metrics: [:demographic_parity, :equalized_odds],
      group_by: :gender,
      threshold: 0.8,
      fail_on_violation: true
    }
  }
)

Security Guardrails

experiment = CrucibleIR.new_experiment(
  id: :secure_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  reliability: %Config{
    guardrails: %Guardrail{
      profiles: [:strict],
      prompt_injection_detection: true,
      jailbreak_detection: true,
      pii_detection: true,
      pii_redaction: true,
      fail_on_detection: true
    }
  }
)

Architecture

CrucibleIR follows a hierarchical structure:

Experiment (top-level)
├── BackendRef (which LLM to use)
├── Pipeline (list of StageDef)
├── DatasetRef (what data to evaluate)
├── Reliability.Config
│   ├── Ensemble (multi-model voting)
│   ├── Hedging (latency optimization)
│   ├── Stats (statistical testing)
│   ├── Fairness (bias detection)
│   └── Guardrails (security)
└── Outputs (list of OutputSpec)

Testing

All modules have comprehensive test coverage:

mix test

Current test stats: 78 tests, 0 failures (3 doctests, 75 unit tests)

Documentation

Generate HTML documentation:

mix docs

Integration with Crucible Ecosystem

CrucibleIR is used by:

crucible_harness - Experiment orchestration
crucible_ensemble - Ensemble voting implementation
crucible_hedging - Request hedging implementation
crucible_bench - Statistical testing
crucible_telemetry - Metrics and instrumentation
crucible_trace - Causal transparency

Design Principles

Immutable Data Structures: All structs are immutable
Type Safety: Full type specifications with @type and @spec
JSON-First: All structs support JSON serialization
Documentation: Every module and public function is documented
Test Coverage: High test coverage with property-based testing

Contributing

This library is part of the North-Shore-AI organization. Contributions welcome!

License

MIT License - See LICENSE file for details