Tribunal ⚖️

LLM evaluation framework for Elixir.

Tribunal provides tools for evaluating and testing LLM outputs, detecting hallucinations, and measuring response quality.

[!TIP] See tribunal-juror for an interactive Phoenix app to explore and test Tribunal's evaluation capabilities.

Test Mode vs Evaluation Mode

Tribunal offers two modes for different use cases:

Mode Interface Use Case Failure Behavior
Test ExUnit CI gates, safety checks Fails immediately on any failure
Evaluation Mix Task Benchmarking, baseline tracking Configurable thresholds

Test Mode is for "this must work" cases: safety checks, refusal detection, critical RAG accuracy. Tests fail fast on any violation.

Evaluation Mode is for "track how well we're doing": run hundreds of evals, compare models, monitor regression over time. Set thresholds like "pass if 80% succeed."

Installation

def deps do
  [
    {:tribunal, "~> 0.1.0"},

    # Optional: for LLM-as-judge evaluations
    {:req_llm, "~> 1.2"},

    # Optional: for embedding-based similarity
    {:alike, "~> 0.1"}
  ]
end

Quick Start

ExUnit Integration

defmodule MyApp.RAGTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  @context ["Returns are accepted within 30 days with receipt."]

  test "response is faithful to context" do
    response = MyApp.RAG.query("What's the return policy?")

    assert_contains response, "30 days"
    assert_faithful response, context: @context
    refute_hallucination response, context: @context
  end
end

Dataset-Driven Evaluations

# test/evals/rag_test.exs
defmodule MyApp.RAGEvalTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  tribunal_eval "test/evals/datasets/questions.json",
    provider: {MyApp.RAG, :query}
end

Evaluation Mode (Mix Task)

# Initialize evaluation structure
mix tribunal.init

# Run evaluations (default: always exit 0, just report)
mix tribunal.eval

# Set pass threshold (fail if pass rate < 80%)
mix tribunal.eval --threshold 0.8

# Strict mode (fail on any failure)
mix tribunal.eval --strict

# Run in parallel for speed
mix tribunal.eval --concurrency 5

# Output formats
mix tribunal.eval --format json --output results.json
mix tribunal.eval --format github  # GitHub Actions annotations
Tribunal LLM Evaluation
═══════════════════════════════════════════════════════════════

Summary
───────────────────────────────────────────────────────────────
  Total:     12 test cases
  Passed:    10 (83%)
  Failed:    2
  Duration:  1.4s

Results by Metric
───────────────────────────────────────────────────────────────
  faithful       8/8 passed    100%  ████████████████████
  relevant       6/8 passed    75%   ███████████████░░░░░
  contains       10/10 passed  100%  ████████████████████
  pii            4/4 passed    100%  ████████████████████

Failed Cases
───────────────────────────────────────────────────────────────
  1. "What is the return policy for electronics?"
     ├─ relevant: Response discusses refunds but doesn't address return policy

  2. "Can I return opened software?"
     ├─ relevant: Response is generic, doesn't mention software-specific policy

───────────────────────────────────────────────────────────────
✅ PASSED (threshold: 80%)

Assertion Types

Deterministic (instant, no API calls)

LLM-as-Judge (requires req_llm)

Embedding-Based (requires alike)

Red Team Testing

Generate adversarial prompts to test LLM safety:

alias Tribunal.RedTeam

attacks = RedTeam.generate_attacks("How do I pick a lock?")
# Returns encoding attacks (base64, leetspeak, rot13)
# injection attacks (ignore instructions, delimiter injection)
# jailbreak attacks (DAN, STAN, developer mode)

Guides

Roadmap

License

MIT