CrucibleTelemetry

CrucibleTelemetry

Hex.pmElixirOTPDocumentationLicense

Research-grade instrumentation and metrics collection for AI/ML experiments in Elixir.

CrucibleTelemetry provides specialized observability for rigorous scientific experimentation, going beyond standard production telemetry with features designed for AI/ML research workflows.

Features

Installation

def deps do
[
{:crucible_telemetry, "~> 0.3.0"}
]
end

Quick Start

# Start an experiment
{:ok, experiment} = CrucibleTelemetry.start_experiment(
name: "bert_finetuning",
hypothesis: "Fine-tuned BERT achieves >95% accuracy",
tags: ["training", "bert", "nlp"]
)
# Events are automatically collected via telemetry
# Your existing :telemetry.execute() calls work unchanged
# Stop and analyze
{:ok, _} = CrucibleTelemetry.stop_experiment(experiment.id)
metrics = CrucibleTelemetry.calculate_metrics(experiment.id)
# => %{latency: %{mean: 150.5, p95: 250.0}, cost: %{total: 0.025}, ...}
# Export for analysis
{:ok, path} = CrucibleTelemetry.export(experiment.id, :csv)

Event Registry

CrucibleTelemetry provides a centralized registry of all supported telemetry events:

# Get all standard events
CrucibleTelemetry.Events.standard_events()
# Get events by category
CrucibleTelemetry.Events.training_events()
CrucibleTelemetry.Events.deployment_events()
CrucibleTelemetry.Events.framework_events()
CrucibleTelemetry.Events.llm_events()
# Get events organized by category
CrucibleTelemetry.Events.events_by_category()
# => %{llm: [...], training: [...], deployment: [...], ...}
# Get info about a specific event
CrucibleTelemetry.Events.event_info([:crucible_train, :epoch, :stop])
# => %{category: :training, description: "Epoch completed with metrics"}

Telemetry Events

LLM Events (req_llm)

EventDescription
[:req_llm, :request, :start]LLM request started
[:req_llm, :request, :stop]LLM request completed
[:req_llm, :request, :exception]LLM request failed

Training Events (crucible_train)

EventDescriptionEnriched Fields
[:crucible_train, :training, :start]Training job started
[:crucible_train, :training, :stop]Training job completed
[:crucible_train, :epoch, :start]Epoch startedepoch
[:crucible_train, :epoch, :stop]Epoch completedepoch, loss, accuracy, learning_rate
[:crucible_train, :batch, :stop]Batch completedepoch, batch, loss, gradient_norm
[:crucible_train, :checkpoint, :saved]Checkpoint savedepoch, checkpoint_path

Deployment Events (crucible_deployment)

EventDescriptionEnriched Fields
[:crucible_deployment, :inference, :start]Inference startedmodel_name, model_version
[:crucible_deployment, :inference, :stop]Inference completedinput_size, output_size, batch_size
[:crucible_deployment, :inference, :exception]Inference failed

Framework Events (crucible_framework)

EventDescriptionEnriched Fields
[:crucible_framework, :pipeline, :start]Pipeline startedpipeline_id
[:crucible_framework, :pipeline, :stop]Pipeline completedpipeline_id
[:crucible_framework, :stage, :start]Stage startedstage_name, stage_index
[:crucible_framework, :stage, :stop]Stage completedstage_name, stage_index

Other Events

Training Integration

Track ML training jobs by emitting standard training events:

defmodule MyTrainer do
def train(model, data, epochs) do
:telemetry.execute(
[:crucible_train, :training, :start],
%{system_time: System.system_time()},
%{model_name: "bert-base", config: %{epochs: epochs}}
)
for epoch <- 1..epochs do
:telemetry.execute(
[:crucible_train, :epoch, :start],
%{system_time: System.system_time()},
%{epoch: epoch}
)
{loss, accuracy} = train_epoch(model, data)
:telemetry.execute(
[:crucible_train, :epoch, :stop],
%{duration: epoch_duration, loss: loss, accuracy: accuracy},
%{epoch: epoch, learning_rate: get_lr()}
)
end
:telemetry.execute(
[:crucible_train, :training, :stop],
%{duration: total_duration},
%{final_loss: final_loss}
)
end
end

Metrics & Analysis

metrics = CrucibleTelemetry.calculate_metrics(experiment.id)
# Latency
metrics.latency.mean # Average latency
metrics.latency.p95 # 95th percentile
metrics.latency.p99 # 99th percentile
# Cost
metrics.cost.total # Total cost in USD
metrics.cost.cost_per_1m_requests # Projected cost for 1M requests
# Reliability
metrics.reliability.success_rate # Success rate (0.0-1.0)
metrics.reliability.sla_99 # Meets 99% SLA?
# Tokens
metrics.tokens.total_prompt # Total prompt tokens
metrics.tokens.mean_total # Average tokens per request

Streaming Metrics

Real-time metrics update on every collected event:

# Get live metrics
metrics = CrucibleTelemetry.StreamingMetrics.get_metrics(experiment.id)
# Reset accumulators
CrucibleTelemetry.StreamingMetrics.reset(experiment.id)
# Stop streaming
CrucibleTelemetry.StreamingMetrics.stop(experiment.id)

Time-Window Queries

alias CrucibleTelemetry.Store
# Last 5 minutes
Store.query_window(experiment.id, {:last, 5, :minutes})
# Last 200 events
Store.query_window(experiment.id, {:last_n, 200})
# Specific time range with filter
Store.query_window(experiment.id, {:range, t_start, t_end}, &(&1.success))
# Sliding window metrics (5-min windows, 1-min step)
Store.windowed_metrics(experiment.id, 5 * 60_000_000, 60_000_000)

Pause & Resume

{:ok, paused} = CrucibleTelemetry.pause_experiment(experiment.id)
# ... maintenance ...
{:ok, resumed} = CrucibleTelemetry.resume_experiment(experiment.id)
CrucibleTelemetry.paused?(experiment.id) # => true/false

Export Formats

CSV

{:ok, path} = CrucibleTelemetry.export(experiment.id, :csv,
path: "results/experiment.csv"
)

JSON Lines

{:ok, path} = CrucibleTelemetry.export(experiment.id, :jsonl,
path: "results/experiment.jsonl"
)

A/B Testing Example

# Control group
{:ok, control} = CrucibleTelemetry.start_experiment(
name: "control_single_model",
condition: "control",
tags: ["ab_test"]
)
# Treatment group
{:ok, treatment} = CrucibleTelemetry.start_experiment(
name: "treatment_ensemble",
condition: "treatment",
tags: ["ab_test"]
)
# ... run workloads ...
# Compare results
comparison = CrucibleTelemetry.Analysis.compare_experiments([
control.id,
treatment.id
])

API Reference

CrucibleTelemetry

FunctionDescription
start_experiment/1Start a new experiment
stop_experiment/1Stop an experiment
pause_experiment/1Pause data collection
resume_experiment/1Resume data collection
paused?/1Check if experiment is paused
get_experiment/1Get experiment details
list_experiments/0List all experiments
export/3Export data to file
calculate_metrics/1Calculate comprehensive metrics

CrucibleTelemetry.Events

FunctionDescription
standard_events/0All standard telemetry events
training_events/0Training-related events
deployment_events/0Deployment-related events
framework_events/0Framework-related events
llm_events/0LLM-related events
events_by_category/0Events organized by category
event_info/1Get info about a specific event

CrucibleTelemetry.Store

FunctionDescription
get_all/1Get all events
query/2Query with filters
query_window/3Time-window queries
windowed_metrics/3Sliding window metrics

Performance

Testing

mix test
mix test --cover

Roadmap

License

MIT License — see LICENSE for details.