Harness

CrucibleHarness

Automated Experiment Orchestration for AI Research

ResearchHarness is a comprehensive Elixir library for orchestrating, executing, and analyzing large-scale AI research experiments. It provides the infrastructure to systematically run experiments across multiple conditions, datasets, and configurations while maintaining reproducibility, fault tolerance, and detailed statistical analysis.

Think of it as "pytest + MLflow + Weights & Biases" for Elixir AI research.

Features

Declarative Experiment Definition - DSL for expressing complex experimental designs
Parallel Execution - Leverage BEAM's concurrency for efficient multi-condition runs
Fault Tolerance - Resume experiments after failures without data loss
Statistical Analysis - Automated significance testing across all condition pairs
Multi-Format Reporting - Generate Markdown, LaTeX, HTML, and Jupyter notebooks
Cost Management - Estimate and control API costs before execution
Reproducibility - Version control for experiments, controlled random seeds, full audit trails

Quick Start

1. Define an Experiment

defmodule MyExperiment do
  use CrucibleHarness.Experiment

  name "My Research Experiment"
  description "Comparing baseline vs treatment"

  dataset :mmlu_200

  conditions [
    %{name: "baseline", fn: &baseline_condition/1},
    %{name: "treatment", fn: &treatment_condition/1}
  ]

  metrics [:accuracy, :latency_p99, :cost_per_query]
  repeat 3

  config %{
    timeout: 30_000,
    rate_limit: 10
  }

  def baseline_condition(query) do
    # Your implementation
    %{prediction: "answer", accuracy: 0.75, latency: 100, cost: 0.01}
  end

  def treatment_condition(query) do
    # Your implementation
    %{prediction: "answer", accuracy: 0.82, latency: 150, cost: 0.02}
  end
end

2. Run the Experiment

# Estimate cost and time first
{:ok, estimates} = CrucibleHarness.estimate(MyExperiment)
IO.puts("Estimated cost: $#{estimates.cost.total_cost}")
IO.puts("Estimated time: #{estimates.time.estimated_duration}ms")

# Run the experiment
{:ok, report} = CrucibleHarness.run(MyExperiment,
  output_dir: "./results",
  formats: [:markdown, :latex, :html]
)

3. View Results

Reports are automatically generated in your specified formats:

results/exp_12345_report.markdown - Markdown report
results/exp_12345_report.latex - LaTeX tables and figures
results/exp_12345_report.html - Interactive HTML report

Advanced Features

Parameter Sweeps

defmodule EnsembleSizeSweep do
  use CrucibleHarness.Experiment

  name "Ensemble Size Sweep (1-10 models)"
  dataset :mmlu_200

  conditions for n <- 1..10 do
    %{
      name: "ensemble_#{n}",
      fn: &ensemble(&1, models: n)
    }
  end

  metrics [:accuracy, :latency_p99, :cost_per_query]
  repeat 5
end

Cost Budgets

cost_budget %{
  max_total: 100.00,          # $100 maximum
  max_per_condition: 25.00,   # $25 per condition max
  currency: :usd
}

Statistical Analysis

statistical_analysis %{
  significance_level: 0.05,
  multiple_testing_correction: :bonferroni,
  confidence_interval: 0.95
}

Checkpointing and Resume

# Run experiment (will checkpoint automatically)
{:ok, report} = CrucibleHarness.run(MyExperiment)

# If interrupted, resume from last checkpoint
{:ok, report} = CrucibleHarness.resume("exp_12345")

Architecture

ResearchHarness
├── Experiment (DSL & Definition)
├── Runner (Execution Engine with GenStage/Flow)
├── Collector (Results Aggregation & Statistical Analysis)
├── Reporter (Multi-Format Output Generation)
└── Utilities (Cost/Time Estimation, Checkpointing)

Example Experiments

See the examples/ directory for complete examples:

simple_comparison.ex - Basic two-condition comparison
ensemble_comparison.ex - Multi-condition ensemble evaluation

API Reference

Main Functions

`CrucibleHarness.run/2`

Runs an experiment and generates reports.

Options:

:output_dir - Directory for results (default: "./results")
:formats - Report formats (default: [:markdown])
:checkpoint_dir - Checkpoint directory (default: "./checkpoints")
:dry_run - Validate without executing (default: false)

`CrucibleHarness.estimate/1`

Estimates cost and time without running the experiment.

`CrucibleHarness.resume/1`

Resumes a failed or interrupted experiment from checkpoint.

Experiment DSL

Required Fields

name - Experiment name
dataset - Dataset identifier
conditions - List of experimental conditions
metrics - Metrics to collect

Optional Fields

description - Detailed description
author - Experiment author
version - Experiment version
tags - Tags for organization
repeat - Number of repetitions (default: 1)
config - Execution configuration
cost_budget - Budget constraints
statistical_analysis - Analysis parameters
custom_metrics - Custom metric definitions

Configuration

Add to your config.exs:

config :research_harness,
  checkpoint_dir: "./checkpoints",
  results_dir: "./results"

Testing

mix test

Installation

Add research_harness to your list of dependencies in mix.exs:

def deps do
  [
    {:crucible_harness, "~> 0.1.0"}
  ]
end

Or install from GitHub:

def deps do
  [
    {:crucible_harness, github: "nshkrdotcom/elixir_ai_research", sparse: "apps/research_harness"}
  ]
end

Documentation

Documentation can be generated with ExDoc:

mix docs

Contributing

This is part of the Spectra AI research infrastructure. Contributions welcome!

License

MIT License - see LICENSE file for details

Acknowledgments

Built for systematic AI research experimentation with a focus on ensemble methods, hedging strategies, and model comparisons.