
CrucibleTelemetry





Research-grade instrumentation and metrics collection for AI/ML experiments in Elixir.
CrucibleTelemetry provides specialized observability for rigorous scientific experimentation, going beyond standard production telemetry with features designed for AI/ML research workflows.
Features
- Experiment Isolation — Run multiple experiments concurrently without cross-contamination
- Centralized Event Registry — Programmatic access to all telemetry event definitions
- Rich Metadata Enrichment — Automatic context, timestamps, and custom tags
- ML Training Support — Track epochs, batches, checkpoints, and training metrics
- Inference Monitoring — Model deployment and inference telemetry
- Pipeline Tracking — Framework stage execution observability
- Streaming Metrics — Real-time latency/cost/reliability stats with O(1) memory
- Time-Window Queries — Fetch last N events or ranges without full rescans
- Multiple Export Formats — CSV, JSON Lines for Python, R, Julia, Excel
- Pause/Resume Lifecycle — Temporarily halt collection without losing state
Installation
def deps do
[
{:crucible_telemetry, "~> 0.3.0"}
]
end
Quick Start
{:ok, experiment} = CrucibleTelemetry.start_experiment(
name: "bert_finetuning",
hypothesis: "Fine-tuned BERT achieves >95% accuracy",
tags: ["training", "bert", "nlp"]
)
{:ok, } = CrucibleTelemetry.stop_experiment(experiment.id)
metrics = CrucibleTelemetry.calculate_metrics(experiment.id)
{:ok, path} = CrucibleTelemetry.export(experiment.id, :csv)
Event Registry
CrucibleTelemetry provides a centralized registry of all supported telemetry events:
CrucibleTelemetry.Events.standard_events()
CrucibleTelemetry.Events.training_events()
CrucibleTelemetry.Events.deployment_events()
CrucibleTelemetry.Events.framework_events()
CrucibleTelemetry.Events.llm_events()
CrucibleTelemetry.Events.events_by_category()
CrucibleTelemetry.Events.event_info([:crucible_train, :epoch, :stop])
Telemetry Events
LLM Events (req_llm)
| Event | Description |
|---|
[:req_llm, :request, :start] | LLM request started |
[:req_llm, :request, :stop] | LLM request completed |
[:req_llm, :request, :exception] | LLM request failed |
Training Events (crucible_train)
| Event | Description | Enriched Fields |
|---|
[:crucible_train, :training, :start] | Training job started | — |
[:crucible_train, :training, :stop] | Training job completed | — |
[:crucible_train, :epoch, :start] | Epoch started | epoch |
[:crucible_train, :epoch, :stop] | Epoch completed | epoch, loss, accuracy, learning_rate |
[:crucible_train, :batch, :stop] | Batch completed | epoch, batch, loss, gradient_norm |
[:crucible_train, :checkpoint, :saved] | Checkpoint saved | epoch, checkpoint_path |
Deployment Events (crucible_deployment)
| Event | Description | Enriched Fields |
|---|
[:crucible_deployment, :inference, :start] | Inference started | model_name, model_version |
[:crucible_deployment, :inference, :stop] | Inference completed | input_size, output_size, batch_size |
[:crucible_deployment, :inference, :exception] | Inference failed | — |
Framework Events (crucible_framework)
| Event | Description | Enriched Fields |
|---|
[:crucible_framework, :pipeline, :start] | Pipeline started | pipeline_id |
[:crucible_framework, :pipeline, :stop] | Pipeline completed | pipeline_id |
[:crucible_framework, :stage, :start] | Stage started | stage_name, stage_index |
[:crucible_framework, :stage, :stop] | Stage completed | stage_name, stage_index |
Other Events
[:ensemble, :prediction, :start|stop] — Ensemble predictions[:ensemble, :vote, :completed] — Voting results[:hedging, :request, :start|duplicated|stop] — Request hedging[:causal_trace, :event, :created] — Reasoning traces[:altar, :tool, :start|stop] — Tool invocations
Training Integration
Track ML training jobs by emitting standard training events:
defmodule MyTrainer do
def train(model, data, epochs) do
:telemetry.execute(
[:crucible_train, :training, :start],
%{system_time: System.system_time()},
%{model_name: "bert-base", config: %{epochs: epochs}}
)
for epoch <- 1..epochs do
:telemetry.execute(
[:crucible_train, :epoch, :start],
%{system_time: System.system_time()},
%{epoch: epoch}
)
{loss, accuracy} = train_epoch(model, data)
:telemetry.execute(
[:crucible_train, :epoch, :stop],
%{duration: epoch_duration, loss: loss, accuracy: accuracy},
%{epoch: epoch, learning_rate: get_lr()}
)
end
:telemetry.execute(
[:crucible_train, :training, :stop],
%{duration: total_duration},
%{final_loss: final_loss}
)
end
end
Metrics & Analysis
metrics = CrucibleTelemetry.calculate_metrics(experiment.id)
metrics.latency.mean
metrics.latency.p95
metrics.latency.p99
metrics.cost.total
metrics.cost.cost_per_1m_requests
metrics.reliability.success_rate
metrics.reliability.sla_99
metrics.tokens.total_prompt
metrics.tokens.mean_total
Streaming Metrics
Real-time metrics update on every collected event:
metrics = CrucibleTelemetry.StreamingMetrics.get_metrics(experiment.id)
CrucibleTelemetry.StreamingMetrics.reset(experiment.id)
CrucibleTelemetry.StreamingMetrics.stop(experiment.id)
Time-Window Queries
alias CrucibleTelemetry.Store
Store.query_window(experiment.id, {:last, 5, :minutes})
Store.query_window(experiment.id, {:last_n, 200})
Store.query_window(experiment.id, {:range, t_start, t_end}, &(&1.success))
Store.windowed_metrics(experiment.id, 5 * 60_000_000, 60_000_000)
Pause & Resume
{:ok, paused} = CrucibleTelemetry.pause_experiment(experiment.id)
{:ok, resumed} = CrucibleTelemetry.resume_experiment(experiment.id)
CrucibleTelemetry.paused?(experiment.id)
Export Formats
CSV
{:ok, path} = CrucibleTelemetry.export(experiment.id, :csv,
path: "results/experiment.csv"
)
JSON Lines
{:ok, path} = CrucibleTelemetry.export(experiment.id, :jsonl,
path: "results/experiment.jsonl"
)
A/B Testing Example
{:ok, control} = CrucibleTelemetry.start_experiment(
name: "control_single_model",
condition: "control",
tags: ["ab_test"]
)
{:ok, treatment} = CrucibleTelemetry.start_experiment(
name: "treatment_ensemble",
condition: "treatment",
tags: ["ab_test"]
)
comparison = CrucibleTelemetry.Analysis.compare_experiments([
control.id,
treatment.id
])
API Reference
CrucibleTelemetry
| Function | Description |
|---|
start_experiment/1 | Start a new experiment |
stop_experiment/1 | Stop an experiment |
pause_experiment/1 | Pause data collection |
resume_experiment/1 | Resume data collection |
paused?/1 | Check if experiment is paused |
get_experiment/1 | Get experiment details |
list_experiments/0 | List all experiments |
export/3 | Export data to file |
calculate_metrics/1 | Calculate comprehensive metrics |
CrucibleTelemetry.Events
| Function | Description |
|---|
standard_events/0 | All standard telemetry events |
training_events/0 | Training-related events |
deployment_events/0 | Deployment-related events |
framework_events/0 | Framework-related events |
llm_events/0 | LLM-related events |
events_by_category/0 | Events organized by category |
event_info/1 | Get info about a specific event |
CrucibleTelemetry.Store
| Function | Description |
|---|
get_all/1 | Get all events |
query/2 | Query with filters |
query_window/3 | Time-window queries |
windowed_metrics/3 | Sliding window metrics |
Performance
- Event handling: <1μs per event (in-memory ETS insert)
- Storage: Up to 1M events in memory (~100-500MB)
- Query: Fast filtering with ETS ordered_set
- Export: Streaming to avoid memory spikes
- Streaming metrics: O(1) space using online algorithms
Testing
mix test
mix test --cover
Roadmap
License
MIT License — see LICENSE for details.