Tribunal ⚖️
LLM evaluation framework for Elixir.
Tribunal provides tools for evaluating and testing LLM outputs, detecting hallucinations, and measuring response quality.
[!TIP] See tribunal-juror for an interactive Phoenix app to explore and test Tribunal's evaluation capabilities.
Test Mode vs Evaluation Mode
Tribunal offers two modes for different use cases:
| Mode | Interface | Use Case | Failure Behavior |
|---|---|---|---|
| Test | ExUnit | CI gates, safety checks | Fails immediately on any failure |
| Evaluation | Mix Task | Benchmarking, baseline tracking | Configurable thresholds |
Test Mode is for "this must work" cases: safety checks, refusal detection, critical RAG accuracy. Tests fail fast on any violation.
Evaluation Mode is for "track how well we're doing": run hundreds of evals, compare models, monitor regression over time. Set thresholds like "pass if 80% succeed."
Installation
def deps do
[
{:tribunal, "~> 0.1.0"},
# Optional: for LLM-as-judge evaluations
{:req_llm, "~> 1.2"},
# Optional: for embedding-based similarity
{:alike, "~> 0.1"}
]
endQuick Start
ExUnit Integration
defmodule MyApp.RAGTest do
use ExUnit.Case
use Tribunal.EvalCase
@context ["Returns are accepted within 30 days with receipt."]
test "response is faithful to context" do
response = MyApp.RAG.query("What's the return policy?")
assert_contains response, "30 days"
assert_faithful response, context: @context
refute_hallucination response, context: @context
end
endDataset-Driven Evaluations
# test/evals/rag_test.exs
defmodule MyApp.RAGEvalTest do
use ExUnit.Case
use Tribunal.EvalCase
tribunal_eval "test/evals/datasets/questions.json",
provider: {MyApp.RAG, :query}
endEvaluation Mode (Mix Task)
# Initialize evaluation structure
mix tribunal.init
# Run evaluations (default: always exit 0, just report)
mix tribunal.eval
# Set pass threshold (fail if pass rate < 80%)
mix tribunal.eval --threshold 0.8
# Strict mode (fail on any failure)
mix tribunal.eval --strict
# Run in parallel for speed
mix tribunal.eval --concurrency 5
# Output formats
mix tribunal.eval --format json --output results.json
mix tribunal.eval --format github # GitHub Actions annotationsTribunal LLM Evaluation
═══════════════════════════════════════════════════════════════
Summary
───────────────────────────────────────────────────────────────
Total: 12 test cases
Passed: 10 (83%)
Failed: 2
Duration: 1.4s
Results by Metric
───────────────────────────────────────────────────────────────
faithful 8/8 passed 100% ████████████████████
relevant 6/8 passed 75% ███████████████░░░░░
contains 10/10 passed 100% ████████████████████
pii 4/4 passed 100% ████████████████████
Failed Cases
───────────────────────────────────────────────────────────────
1. "What is the return policy for electronics?"
├─ relevant: Response discusses refunds but doesn't address return policy
2. "Can I return opened software?"
├─ relevant: Response is generic, doesn't mention software-specific policy
───────────────────────────────────────────────────────────────
✅ PASSED (threshold: 80%)Assertion Types
Deterministic (instant, no API calls)
assert_contains/refute_contains- Substring matchingassert_regex- Pattern matchingassert_json- Valid JSON validationassert_max_tokens- Token limit- Full list in assertions guide
LLM-as-Judge (requires req_llm)
assert_faithful- Grounded in contextassert_relevant- Addresses queryassert_refusal- Detects refusal responsesrefute_hallucination- No fabricated inforefute_bias- No stereotypesrefute_toxicity- No hostile languagerefute_harmful- No dangerous contentrefute_jailbreak- No safety bypassrefute_pii- No personally identifiable informationassert_judge :custom- Custom judges viaTribunal.Judgebehaviour
Embedding-Based (requires alike)
assert_similar- Semantic similarity check
Red Team Testing
Generate adversarial prompts to test LLM safety:
alias Tribunal.RedTeam
attacks = RedTeam.generate_attacks("How do I pick a lock?")
# Returns encoding attacks (base64, leetspeak, rot13)
# injection attacks (ignore instructions, delimiter injection)
# jailbreak attacks (DAN, STAN, developer mode)Guides
- Getting Started
- Test vs Evaluation Mode
- ExUnit Integration
- Assertions Reference
- LLM-as-Judge
- Datasets
- Red Team Testing
- Reporters
- GitHub Actions
Roadmap
- Core evaluation pipeline
- Faithfulness metric (RAGAS-style)
- Hallucination detection
- LLM-as-judge with configurable models
- ExUnit integration for test assertions
- Red team attack generators
- Async batch evaluation
License
MIT