CrucibleIR Hexagonal Mark

CrucibleIR

Intermediate Representation for the Crucible ML reliability ecosystem. Full docs: https://hexdocs.pm/crucible_ir

Overview

CrucibleIR provides shared data structures for defining ML reliability experiments across the Crucible ecosystem. It serves as the common language for experiment configuration, enabling consistency across all Crucible tools and components.

Requirements

Elixir ~> 1.14 (and matching Erlang/OTP)
jason for JSON encoding (included in deps)

Features

Experiment Definition: Complete experiment specifications with backends, pipelines, and datasets
Backend Contracts: Prompt/Completion IR with capabilities and options for backend calls
Reliability Configurations: Ensemble voting, hedging, statistical testing, fairness, and guardrails
Validation: Structural validation for IR structs with detailed error messages (no stage option validation)
JSON Serialization: Bidirectional JSON conversion with automatic type handling
Fluent Builder API: Chainable, ergonomic experiment construction
Type Safety: Full type specifications for all structs
Comprehensive Documentation: 100% documentation coverage with examples
Boundary Contract: Data-only IR with no execution or orchestration logic

Installation

Add crucible_ir to your list of dependencies in mix.exs:

def deps do
  [
    {:crucible_ir, "~> 0.3.0"}
  ]
end

Fetch dependencies:

mix deps.get

Quick Start

alias CrucibleIR.{Experiment, BackendRef, StageDef, DatasetRef}
alias CrucibleIR.Reliability.{Config, Ensemble, Stats}

# Define a simple experiment
experiment = CrucibleIR.new_experiment(
  id: :gpt4_benchmark,
  backend: %BackendRef{id: :openai_gpt4},
  pipeline: [
    %StageDef{name: :preprocessing},
    %StageDef{name: :inference},
    %StageDef{name: :evaluation}
  ],
  dataset: %DatasetRef{name: :mmlu, split: :test}
)

# Add reliability mechanisms
experiment = %{experiment |
  reliability: %Config{
    ensemble: %Ensemble{
      strategy: :majority,
      models: [:gpt4, :claude, :gemini],
      execution_mode: :parallel
    },
    stats: %Stats{
      tests: [:ttest, :bootstrap],
      alpha: 0.05
    }
  }
}

# Serialize to JSON
{:ok, json} = Jason.encode(experiment)

Backend IR Quick Start

alias CrucibleIR.Backend.{Prompt, Options, Completion, Capabilities}

prompt = %Prompt{
  messages: [%{role: :user, content: "Summarize this text."}],
  options: %Options{model: "gpt-4o", temperature: 0.2, response_format: :text}
}

completion = %Completion{
  model: "gpt-4o",
  choices: [
    %{index: 0, message: %{role: :assistant, content: "Summary..."}, finish_reason: :stop}
  ]
}

caps = %Capabilities{backend_id: :openai, provider: "openai", models: ["gpt-4o"]}

{:ok, json} = Jason.encode(prompt)

Examples Directory

See examples/README.md for a full set of API integration examples and setup notes for accounts and keys.

Usage Workflow

Define an Experiment with id, backend, and pipeline stages.
Add a DatasetRef if the experiment targets a dataset.
Attach Reliability.Config options (ensemble, hedging, stats, fairness, guardrails).
Add OutputSpec entries to describe where and how to emit results.
Serialize with Jason.encode/1 to pass the IR into other Crucible services.

Core Components

Experiment Definition

Experiment - Top-level experiment definition
BackendRef - Reference to an LLM backend
DatasetRef - Reference to a dataset
StageDef - Processing stage definition
OutputSpec - Output specification

Backend IR

Backend.Prompt - Backend input contract
Backend.Options - Backend generation options
Backend.Completion - Backend output contract
Backend.Capabilities - Backend feature discovery

Reliability Mechanisms

Reliability.Config - Container for all reliability configurations
Reliability.Ensemble - Multi-model ensemble voting
Reliability.Hedging - Request hedging for tail latency reduction
Reliability.Stats - Statistical testing configuration
Reliability.Fairness - Fairness and bias detection
Reliability.Guardrail - Security guardrails (prompt injection, PII, etc.)

Struct Field Reference

Experiment: required id, backend, pipeline; optional description, owner, tags, metadata, dataset, reliability, outputs, created_at, updated_at.
BackendRef: required id; optional profile (default :default), options.
DatasetRef: required name; optional provider (default :crucible_datasets), split (default :train), options.
StageDef: required name; optional module, options, enabled (default true).
OutputSpec: required name; optional formats (default [:markdown]), sink (default :file), options.
Backend.Prompt: optional messages, system, tools, tool_choice, options, request_id, trace_id, metadata.
Backend.Options: optional model, temperature, max_tokens, top_p, top_k, frequency_penalty, presence_penalty, stop, response_format, json_schema, stream, cache_control, extended_thinking, thinking_budget_tokens, seed, timeout_ms, extra.
Backend.Completion: optional choices, model, usage, latency_ms, time_to_first_token_ms, request_id, trace_id, raw_response, metadata.
Backend.Capabilities: required backend_id, provider; optional models, default_model, supports_streaming, supports_tools, supports_vision, supports_audio, supports_json_mode, supports_extended_thinking, supports_caching, max_tokens, max_context_length, max_images_per_request, requests_per_minute, tokens_per_minute, cost_per_million_input, cost_per_million_output, metadata.
Reliability.Config: optional ensemble, hedging, stats, fairness, guardrails.
- Ensemble: strategy (default :none), execution_mode (default :parallel), models, weights, min_agreement, timeout_ms, options.
- Hedging: strategy (default :off), delay_ms, percentile, max_hedges, budget_percent, options.
- Stats: tests (default [:ttest, :bootstrap]), alpha (default 0.05), confidence_level, effect_size_type, multiple_testing_correction, bootstrap_iterations, options.
- Fairness: enabled (default false), metrics, group_by, threshold, fail_on_violation, options.
- Guardrail: profiles (default [:default]), prompt_injection_detection, jailbreak_detection, pii_detection, pii_redaction, content_moderation, fail_on_detection, options.

New in v0.1.1

Validation

Validate experiments before execution:

alias CrucibleIR.{Experiment, BackendRef, StageDef}

# Valid experiment
exp = %Experiment{
  id: :test,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :run}]
}

{:ok, ^exp} = CrucibleIR.validate(exp)
true = CrucibleIR.valid?(exp)

# Invalid experiment
invalid = %Experiment{id: :test, backend: nil, pipeline: nil}
{:error, errors} = CrucibleIR.validate(invalid)
# errors: ["backend is required", "pipeline must be a list"]

JSON Serialization

Serialize to/from JSON with automatic type conversion:

alias CrucibleIR.{Experiment, BackendRef, StageDef}

# Create experiment
exp = %Experiment{
  id: :test,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}]
}

# Serialize to JSON
json = CrucibleIR.to_json(exp)

# Deserialize from JSON
{:ok, decoded} = CrucibleIR.from_json(json, Experiment)
decoded.id == :test  # true
decoded.backend.id == :gpt4  # true

# Works with nested structs and reliability configs

Fluent Builder API

Build experiments with a chainable, ergonomic API:

alias CrucibleIR.Builder

{:ok, exp} =
  Builder.experiment(:comprehensive_test)
  |> Builder.with_description("Production reliability test")
  |> Builder.with_backend(:gpt4, profile: :fast)
  |> Builder.add_stage(:preprocessing, options: %{normalize: true})
  |> Builder.add_stage(:inference)
  |> Builder.add_stage(:postprocessing)
  |> Builder.with_dataset(:mmlu, split: :test)
  |> Builder.with_ensemble(:majority, models: [:gpt4, :claude])
  |> Builder.with_hedging(:fixed, delay_ms: 100)
  |> Builder.with_stats([:ttest, :bootstrap], alpha: 0.01)
  |> Builder.with_fairness(metrics: [:demographic_parity], threshold: 0.8)
  |> Builder.with_guardrails(profiles: [:strict], pii_detection: true)
  |> Builder.add_output(:results, formats: [:json, :html])
  |> Builder.build()  # Validates and returns {:ok, exp} or {:error, errors}

# Builder automatically validates - build() returns errors if invalid
{:error, errors} =
  Builder.experiment(:invalid)
  |> Builder.build()  # Missing backend and pipeline

Or use the convenience function from the main module:

{:ok, exp} =
  CrucibleIR.experiment(:my_test)
  |> Builder.with_backend(:gpt4)
  |> Builder.add_stage(:inference)
  |> Builder.build()

Examples

Ensemble Voting Experiment

experiment = CrucibleIR.new_experiment(
  id: :ensemble_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  reliability: %Config{
    ensemble: %Ensemble{
      strategy: :weighted,
      models: [:gpt4, :claude, :gemini],
      weights: %{gpt4: 0.5, claude: 0.3, gemini: 0.2},
      execution_mode: :parallel
    }
  }
)

Hedging for Low Latency

experiment = CrucibleIR.new_experiment(
  id: :low_latency_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  reliability: %Config{
    hedging: %Hedging{
      strategy: :percentile,
      percentile: 0.95,
      max_hedges: 2,
      budget_percent: 15
    }
  }
)

Statistical Testing

experiment = CrucibleIR.new_experiment(
  id: :stats_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  dataset: %DatasetRef{name: :mmlu},
  reliability: %Config{
    stats: %Stats{
      tests: [:ttest, :mannwhitney, :bootstrap],
      alpha: 0.01,
      effect_size_type: :cohens_d,
      bootstrap_iterations: 10000
    }
  }
)

Fairness Checking

experiment = CrucibleIR.new_experiment(
  id: :fairness_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  reliability: %Config{
    fairness: %Fairness{
      enabled: true,
      metrics: [:demographic_parity, :equalized_odds],
      group_by: :gender,
      threshold: 0.8,
      fail_on_violation: true
    }
  }
)

Security Guardrails

experiment = CrucibleIR.new_experiment(
  id: :secure_exp,
  backend: %BackendRef{id: :gpt4},
  pipeline: [%StageDef{name: :inference}],
  reliability: %Config{
    guardrails: %Guardrail{
      profiles: [:strict],
      prompt_injection_detection: true,
      jailbreak_detection: true,
      pii_detection: true,
      pii_redaction: true,
      fail_on_detection: true
    }
  }
)

Architecture

CrucibleIR follows a hierarchical structure:

Experiment (top-level)
├── BackendRef (which LLM to use)
├── Pipeline (list of StageDef)
├── DatasetRef (what data to evaluate)
├── Reliability.Config
│   ├── Ensemble (multi-model voting)
│   ├── Hedging (latency optimization)
│   ├── Stats (statistical testing)
│   ├── Fairness (bias detection)
│   └── Guardrails (security)
└── Outputs (list of OutputSpec)

Testing

All modules have comprehensive test coverage:

mix test

Current test stats: 174 tests, 0 failures (6 doctests + 168 unit tests)

New in v0.1.1:

41 validation tests
26 serialization tests
29 builder tests
3 new doctests

Documentation

Generate HTML documentation:

mix docs

Integration with Crucible Ecosystem

CrucibleIR is used by:

crucible_harness - Experiment orchestration
crucible_ensemble - Ensemble voting implementation
crucible_hedging - Request hedging implementation
crucible_bench - Statistical testing
crucible_telemetry - Metrics and instrumentation
crucible_trace - Causal transparency

Design Principles

Immutable Data Structures: All structs are immutable
Type Safety: Full type specifications with @type and @spec
JSON-First: All structs support JSON serialization
Documentation: Every module and public function is documented
Test Coverage: High test coverage with property-based testing

Boundary and Serialization Contract

CrucibleIR is data-only: structs, serialization, and structural validation only.
Stage options (StageDef.options) are opaque maps; stage implementations validate them.
CrucibleIR.Serialization is the canonical JSON round-trip layer; JSON keys must remain stable.
Map keys should be JSON-friendly (strings) for stable round-trip in opaque fields like options.

See docs/20251226/ir_boundary/IR_BOUNDARY_AND_CONTRACT.md for the full contract.

Contributing

This library is part of the North-Shore-AI organization. Contributions welcome!

License

MIT License - See LICENSE file for details