CrucibleIR
Intermediate Representation for the Crucible ML reliability ecosystem. Full docs: https://hexdocs.pm/crucible_ir
Overview
CrucibleIR provides shared data structures for defining ML reliability experiments across the Crucible ecosystem. It serves as the common language for experiment configuration, enabling consistency across all Crucible tools and components.
Requirements
-
Elixir
~> 1.14(and matching Erlang/OTP) jasonfor JSON encoding (included in deps)
Features
- Experiment Definition: Complete experiment specifications with backends, pipelines, and datasets
- Backend Contracts: Prompt/Completion IR with capabilities and options for backend calls
- Reliability Configurations: Ensemble voting, hedging, statistical testing, fairness, and guardrails
- Validation: Structural validation for IR structs with detailed error messages (no stage option validation)
- JSON Serialization: Bidirectional JSON conversion with automatic type handling
- Fluent Builder API: Chainable, ergonomic experiment construction
- Type Safety: Full type specifications for all structs
- Comprehensive Documentation: 100% documentation coverage with examples
- Boundary Contract: Data-only IR with no execution or orchestration logic
Installation
Add crucible_ir to your list of dependencies in mix.exs:
def deps do
[
{:crucible_ir, "~> 0.3.0"}
]
endFetch dependencies:
mix deps.getQuick Start
alias CrucibleIR.{Experiment, BackendRef, StageDef, DatasetRef}
alias CrucibleIR.Reliability.{Config, Ensemble, Stats}
# Define a simple experiment
experiment = CrucibleIR.new_experiment(
id: :gpt4_benchmark,
backend: %BackendRef{id: :openai_gpt4},
pipeline: [
%StageDef{name: :preprocessing},
%StageDef{name: :inference},
%StageDef{name: :evaluation}
],
dataset: %DatasetRef{name: :mmlu, split: :test}
)
# Add reliability mechanisms
experiment = %{experiment |
reliability: %Config{
ensemble: %Ensemble{
strategy: :majority,
models: [:gpt4, :claude, :gemini],
execution_mode: :parallel
},
stats: %Stats{
tests: [:ttest, :bootstrap],
alpha: 0.05
}
}
}
# Serialize to JSON
{:ok, json} = Jason.encode(experiment)Backend IR Quick Start
alias CrucibleIR.Backend.{Prompt, Options, Completion, Capabilities}
prompt = %Prompt{
messages: [%{role: :user, content: "Summarize this text."}],
options: %Options{model: "gpt-4o", temperature: 0.2, response_format: :text}
}
completion = %Completion{
model: "gpt-4o",
choices: [
%{index: 0, message: %{role: :assistant, content: "Summary..."}, finish_reason: :stop}
]
}
caps = %Capabilities{backend_id: :openai, provider: "openai", models: ["gpt-4o"]}
{:ok, json} = Jason.encode(prompt)Examples Directory
See examples/README.md for a full set of API integration examples and setup
notes for accounts and keys.
Usage Workflow
-
Define an
Experimentwithid,backend, andpipelinestages. -
Add a
DatasetRefif the experiment targets a dataset. -
Attach
Reliability.Configoptions (ensemble, hedging, stats, fairness, guardrails). -
Add
OutputSpecentries to describe where and how to emit results. -
Serialize with
Jason.encode/1to pass the IR into other Crucible services.
Core Components
Experiment Definition
Experiment- Top-level experiment definitionBackendRef- Reference to an LLM backendDatasetRef- Reference to a datasetStageDef- Processing stage definitionOutputSpec- Output specification
Backend IR
Backend.Prompt- Backend input contractBackend.Options- Backend generation optionsBackend.Completion- Backend output contractBackend.Capabilities- Backend feature discovery
Reliability Mechanisms
Reliability.Config- Container for all reliability configurationsReliability.Ensemble- Multi-model ensemble votingReliability.Hedging- Request hedging for tail latency reductionReliability.Stats- Statistical testing configurationReliability.Fairness- Fairness and bias detectionReliability.Guardrail- Security guardrails (prompt injection, PII, etc.)
Struct Field Reference
- Experiment: required
id,backend,pipeline; optionaldescription,owner,tags,metadata,dataset,reliability,outputs,created_at,updated_at. - BackendRef: required
id; optionalprofile(default:default),options. - DatasetRef: required
name; optionalprovider(default:crucible_datasets),split(default:train),options. - StageDef: required
name; optionalmodule,options,enabled(defaulttrue). - OutputSpec: required
name; optionalformats(default[:markdown]),sink(default:file),options. - Backend.Prompt: optional
messages,system,tools,tool_choice,options,request_id,trace_id,metadata. - Backend.Options: optional
model,temperature,max_tokens,top_p,top_k,frequency_penalty,presence_penalty,stop,response_format,json_schema,stream,cache_control,extended_thinking,thinking_budget_tokens,seed,timeout_ms,extra. - Backend.Completion: optional
choices,model,usage,latency_ms,time_to_first_token_ms,request_id,trace_id,raw_response,metadata. - Backend.Capabilities: required
backend_id,provider; optionalmodels,default_model,supports_streaming,supports_tools,supports_vision,supports_audio,supports_json_mode,supports_extended_thinking,supports_caching,max_tokens,max_context_length,max_images_per_request,requests_per_minute,tokens_per_minute,cost_per_million_input,cost_per_million_output,metadata. - Reliability.Config: optional
ensemble,hedging,stats,fairness,guardrails.- Ensemble:
strategy(default:none),execution_mode(default:parallel),models,weights,min_agreement,timeout_ms,options. - Hedging:
strategy(default:off),delay_ms,percentile,max_hedges,budget_percent,options. - Stats:
tests(default[:ttest, :bootstrap]),alpha(default0.05),confidence_level,effect_size_type,multiple_testing_correction,bootstrap_iterations,options. - Fairness:
enabled(defaultfalse),metrics,group_by,threshold,fail_on_violation,options. - Guardrail:
profiles(default[:default]),prompt_injection_detection,jailbreak_detection,pii_detection,pii_redaction,content_moderation,fail_on_detection,options.
- Ensemble:
New in v0.1.1
Validation
Validate experiments before execution:
alias CrucibleIR.{Experiment, BackendRef, StageDef}
# Valid experiment
exp = %Experiment{
id: :test,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :run}]
}
{:ok, ^exp} = CrucibleIR.validate(exp)
true = CrucibleIR.valid?(exp)
# Invalid experiment
invalid = %Experiment{id: :test, backend: nil, pipeline: nil}
{:error, errors} = CrucibleIR.validate(invalid)
# errors: ["backend is required", "pipeline must be a list"]JSON Serialization
Serialize to/from JSON with automatic type conversion:
alias CrucibleIR.{Experiment, BackendRef, StageDef}
# Create experiment
exp = %Experiment{
id: :test,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}]
}
# Serialize to JSON
json = CrucibleIR.to_json(exp)
# Deserialize from JSON
{:ok, decoded} = CrucibleIR.from_json(json, Experiment)
decoded.id == :test # true
decoded.backend.id == :gpt4 # true
# Works with nested structs and reliability configsFluent Builder API
Build experiments with a chainable, ergonomic API:
alias CrucibleIR.Builder
{:ok, exp} =
Builder.experiment(:comprehensive_test)
|> Builder.with_description("Production reliability test")
|> Builder.with_backend(:gpt4, profile: :fast)
|> Builder.add_stage(:preprocessing, options: %{normalize: true})
|> Builder.add_stage(:inference)
|> Builder.add_stage(:postprocessing)
|> Builder.with_dataset(:mmlu, split: :test)
|> Builder.with_ensemble(:majority, models: [:gpt4, :claude])
|> Builder.with_hedging(:fixed, delay_ms: 100)
|> Builder.with_stats([:ttest, :bootstrap], alpha: 0.01)
|> Builder.with_fairness(metrics: [:demographic_parity], threshold: 0.8)
|> Builder.with_guardrails(profiles: [:strict], pii_detection: true)
|> Builder.add_output(:results, formats: [:json, :html])
|> Builder.build() # Validates and returns {:ok, exp} or {:error, errors}
# Builder automatically validates - build() returns errors if invalid
{:error, errors} =
Builder.experiment(:invalid)
|> Builder.build() # Missing backend and pipelineOr use the convenience function from the main module:
{:ok, exp} =
CrucibleIR.experiment(:my_test)
|> Builder.with_backend(:gpt4)
|> Builder.add_stage(:inference)
|> Builder.build()Examples
Ensemble Voting Experiment
experiment = CrucibleIR.new_experiment(
id: :ensemble_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
reliability: %Config{
ensemble: %Ensemble{
strategy: :weighted,
models: [:gpt4, :claude, :gemini],
weights: %{gpt4: 0.5, claude: 0.3, gemini: 0.2},
execution_mode: :parallel
}
}
)Hedging for Low Latency
experiment = CrucibleIR.new_experiment(
id: :low_latency_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
reliability: %Config{
hedging: %Hedging{
strategy: :percentile,
percentile: 0.95,
max_hedges: 2,
budget_percent: 15
}
}
)Statistical Testing
experiment = CrucibleIR.new_experiment(
id: :stats_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
dataset: %DatasetRef{name: :mmlu},
reliability: %Config{
stats: %Stats{
tests: [:ttest, :mannwhitney, :bootstrap],
alpha: 0.01,
effect_size_type: :cohens_d,
bootstrap_iterations: 10000
}
}
)Fairness Checking
experiment = CrucibleIR.new_experiment(
id: :fairness_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
reliability: %Config{
fairness: %Fairness{
enabled: true,
metrics: [:demographic_parity, :equalized_odds],
group_by: :gender,
threshold: 0.8,
fail_on_violation: true
}
}
)Security Guardrails
experiment = CrucibleIR.new_experiment(
id: :secure_exp,
backend: %BackendRef{id: :gpt4},
pipeline: [%StageDef{name: :inference}],
reliability: %Config{
guardrails: %Guardrail{
profiles: [:strict],
prompt_injection_detection: true,
jailbreak_detection: true,
pii_detection: true,
pii_redaction: true,
fail_on_detection: true
}
}
)Architecture
CrucibleIR follows a hierarchical structure:
Experiment (top-level)
├── BackendRef (which LLM to use)
├── Pipeline (list of StageDef)
├── DatasetRef (what data to evaluate)
├── Reliability.Config
│ ├── Ensemble (multi-model voting)
│ ├── Hedging (latency optimization)
│ ├── Stats (statistical testing)
│ ├── Fairness (bias detection)
│ └── Guardrails (security)
└── Outputs (list of OutputSpec)Testing
All modules have comprehensive test coverage:
mix testCurrent test stats: 174 tests, 0 failures (6 doctests + 168 unit tests)
New in v0.1.1:
- 41 validation tests
- 26 serialization tests
- 29 builder tests
- 3 new doctests
Documentation
Generate HTML documentation:
mix docsIntegration with Crucible Ecosystem
CrucibleIR is used by:
- crucible_harness - Experiment orchestration
- crucible_ensemble - Ensemble voting implementation
- crucible_hedging - Request hedging implementation
- crucible_bench - Statistical testing
- crucible_telemetry - Metrics and instrumentation
- crucible_trace - Causal transparency
Design Principles
- Immutable Data Structures: All structs are immutable
- Type Safety: Full type specifications with
@typeand@spec - JSON-First: All structs support JSON serialization
- Documentation: Every module and public function is documented
- Test Coverage: High test coverage with property-based testing
Boundary and Serialization Contract
- CrucibleIR is data-only: structs, serialization, and structural validation only.
-
Stage options (
StageDef.options) are opaque maps; stage implementations validate them. CrucibleIR.Serializationis the canonical JSON round-trip layer; JSON keys must remain stable.-
Map keys should be JSON-friendly (strings) for stable round-trip in opaque fields like
options.
See docs/20251226/ir_boundary/IR_BOUNDARY_AND_CONTRACT.md for the full contract.
Contributing
This library is part of the North-Shore-AI organization. Contributions welcome!
License
MIT License - See LICENSE file for details
Links
- GitHub: https://github.com/North-Shore-AI/crucible_ir
- Documentation: https://hexdocs.pm/crucible_ir
- Crucible Framework: https://github.com/North-Shore-AI/crucible_framework