CrucibleIR
Intermediate Representation for the Crucible ML reliability ecosystem. Full docs: https://hexdocs.pm/crucible_ir
Overview
CrucibleIR provides shared data structures for defining ML reliability experiments across the Crucible ecosystem. It serves as the common language for experiment configuration, enabling consistency across all Crucible tools and components.
Requirements
- Elixir
~> 1.14(and matching Erlang/OTP) jasonfor JSON encoding (included in deps)
Features
- Experiment Definition: Complete experiment specifications with backends, pipelines, and datasets
- Reliability Configurations: Ensemble voting, hedging, statistical testing, fairness, and guardrails
- Type Safety: Full type specifications for all structs
- JSON Serialization: All structs derive
Jason.Encoderfor easy serialization - Comprehensive Documentation: 100% documentation coverage with examples
Installation
Add crucible_ir to your list of dependencies in mix.exs:
def deps do
[
{:crucible_ir, "~> 0.1.0"}
]
end
Fetch dependencies:
mix deps.get
Quick Start
alias CrucibleIR.{Experiment, BackendRef, StageDef, DatasetRef}
alias CrucibleIR.Reliability.{Config, Ensemble, Stats}
# Define a simple experiment
experiment = CrucibleIR.new_experiment(
id: :gpt4_benchmark,
backend: %BackendRef{id: :openai_gpt4},
pipeline: [
%StageDef{name: :preprocessing},
%StageDef{name: :inference},
%StageDef{name: :evaluation}
],
dataset: %DatasetRef{name: :mmlu, split: :test}
)
# Add reliability mechanisms
experiment = %{experiment |
reliability: %Config{
ensemble: %Ensemble{
strategy: :majority,
models: [:gpt4, :claude, :gemini],
execution_mode: :parallel
},
stats: %Stats{
tests: [:ttest, :bootstrap],
alpha: 0.05
}
}
}
# Serialize to JSON
{:ok, json} = Jason.encode(experiment)
Usage Workflow
- Define an
Experimentwithid,backend, andpipelinestages. - Add a
DatasetRefif the experiment targets a dataset. - Attach
Reliability.Configoptions (ensemble, hedging, stats, fairness, guardrails). - Add
OutputSpecentries to describe where and how to emit results. - Serialize with
Jason.encode/1to pass the IR into other Crucible services.
Core Components
Experiment Definition
Experiment- Top-level experiment definitionBackendRef- Reference to an LLM backendDatasetRef- Reference to a datasetStageDef- Processing stage definitionOutputSpec- Output specification
Reliability Mechanisms
Reliability.Config- Container for all reliability configurationsReliability.Ensemble- Multi-model ensemble votingReliability.Hedging- Request hedging for tail latency reductionReliability.Stats- Statistical testing configurationReliability.Fairness- Fairness and bias detectionReliability.Guardrail- Security guardrails (prompt injection, PII, etc.)
Struct Field Reference
- Experiment: required
id,backend,pipeline; optionaldescription,owner,tags,metadata,dataset,reliability,outputs,created_at,updated_at. - BackendRef: required
id; optionalprofile(default:default),options. - DatasetRef: required
name; optionalprovider(default:crucible_datasets),split(default:train),options. - StageDef: required
name; optionalmodule,options,enabled(defaulttrue). - OutputSpec: required
name; optionalformats(default[:markdown]),sink(default:file),options. - Reliability.Config: optional
ensemble,hedging,stats,fairness,guardrails.- Ensemble:
strategy(default:none),execution_mode(default:parallel),models,weights,min_agreement,timeout_ms,options. - Hedging:
strategy(default:off),delay_ms,percentile,max_hedges,budget_percent,options. - Stats:
tests(default[:ttest, :bootstrap]),alpha(default0.05),confidence_level,effect_size_type,multiple_testing_correction,bootstrap_iterations,options. - Fairness:
enabled(defaultfalse),metrics,group_by,threshold,fail_on_violation,options. - Guardrail:
profiles(default[:default]),prompt_injection_detection,jailbreak_detection,pii_detection,pii_redaction,content_moderation,fail_on_detection,options.
- Ensemble:
Examples
Ensemble Voting Experiment
experiment = CrucibleIR.new_experiment(
id: :ensemble_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
reliability: %Config{
ensemble: %Ensemble{
strategy: :weighted,
models: [:gpt4, :claude, :gemini],
weights: %{gpt4: 0.5, claude: 0.3, gemini: 0.2},
execution_mode: :parallel
}
}
)
Hedging for Low Latency
experiment = CrucibleIR.new_experiment(
id: :low_latency_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
reliability: %Config{
hedging: %Hedging{
strategy: :percentile,
percentile: 0.95,
max_hedges: 2,
budget_percent: 15
}
}
)
Statistical Testing
experiment = CrucibleIR.new_experiment(
id: :stats_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
dataset: %DatasetRef{name: :mmlu},
reliability: %Config{
stats: %Stats{
tests: [:ttest, :mannwhitney, :bootstrap],
alpha: 0.01,
effect_size_type: :cohens_d,
bootstrap_iterations: 10000
}
}
)
Fairness Checking
experiment = CrucibleIR.new_experiment(
id: :fairness_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
reliability: %Config{
fairness: %Fairness{
enabled: true,
metrics: [:demographic_parity, :equalized_odds],
group_by: :gender,
threshold: 0.8,
fail_on_violation: true
}
}
)
Security Guardrails
experiment = CrucibleIR.new_experiment(
id: :secure_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
reliability: %Config{
guardrails: %Guardrail{
profiles: [:strict],
prompt_injection_detection: true,
jailbreak_detection: true,
pii_detection: true,
pii_redaction: true,
fail_on_detection: true
}
}
)
Architecture
CrucibleIR follows a hierarchical structure:
Experiment (top-level)
├── BackendRef (which LLM to use)
├── Pipeline (list of StageDef)
├── DatasetRef (what data to evaluate)
├── Reliability.Config
│ ├── Ensemble (multi-model voting)
│ ├── Hedging (latency optimization)
│ ├── Stats (statistical testing)
│ ├── Fairness (bias detection)
│ └── Guardrails (security)
└── Outputs (list of OutputSpec)
Testing
All modules have comprehensive test coverage:
mix test
Current test stats: 78 tests, 0 failures (3 doctests, 75 unit tests)
Documentation
Generate HTML documentation:
mix docs
Integration with Crucible Ecosystem
CrucibleIR is used by:
- crucible_harness - Experiment orchestration
- crucible_ensemble - Ensemble voting implementation
- crucible_hedging - Request hedging implementation
- crucible_bench - Statistical testing
- crucible_telemetry - Metrics and instrumentation
- crucible_trace - Causal transparency
Design Principles
- Immutable Data Structures: All structs are immutable
- Type Safety: Full type specifications with
@typeand@spec - JSON-First: All structs support JSON serialization
- Documentation: Every module and public function is documented
- Test Coverage: High test coverage with property-based testing
Contributing
This library is part of the North-Shore-AI organization. Contributions welcome!
License
MIT License - See LICENSE file for details
Links
- GitHub: https://github.com/North-Shore-AI/crucible_ir
- Documentation: https://hexdocs.pm/crucible_ir
- Crucible Framework: https://github.com/North-Shore-AI/crucible_framework