Assessor

Document Version: 1.0 Status: Proposed

1. Vision & Mission

Project Name:Assessor

Vision: To be the definitive Continuous Integration & Continuous Delivery (CI/CD) platform for AI Quality, enabling teams to ship AI-powered features with the same rigor, confidence, and automation as traditional software.

Mission: To provide a scalable, extensible, and observable framework for defining, executing, and analyzing complex evaluation pipelines for AI models and prompts. Assessor transforms AI evaluation from a manual, ad-hoc process into a systematic, automated engineering discipline.

2. The Problem: The Missing Link in MLOps

While the industry has mature CI/CD tools for code (like Jenkins, GitHub Actions), the process for validating AI components remains dangerously primitive. An engineer can change a single sentence in a prompt or fine-tune a model with a new dataset, and in doing so, silently break a dozen critical capabilities.

The core challenges are:

Scale of Test Cases: A thorough evaluation requires running a model against thousands or tens of thousands of test cases, spanning regressions, capabilities, safety, and adversarial attacks. This is computationally intensive and requires massive parallelism.
Complexity of Evaluation: "Accuracy" is a poor metric for modern LLMs. A test case might need to check for valid JSON output, verify a specific reasoning path, or even use another powerful LLM (like GPT-4) as a judge to grade the response.
Lack of Structure: Evals are often managed as a messy collection of notebooks and scripts, making them difficult to version, share, and run consistently.
Observability Black Hole: When an eval test case fails, it's often a mystery why. You get a FAIL status but no deep, debuggable trace of the model's reasoning process during the failure.

Assessor is designed to solve these problems by treating "AI evaluations as code," managed within a robust, parallel execution engine.

3. Core Concepts & Architecture

Assessor is a complete OTP application, built around a high-throughput, durable job processing core. It is architected using the "Transparent Engine" model to ensure every evaluation is deeply observable.

The Core Architectural Primitives:

Evaluation (Ecto Schema): The top-level object. An "Evaluation" is a collection of Test Suites run against a specific Model Candidate (e.g., "Evaluate Llama3-8B-fine-tuned-v3"). It has a status: pending, running, passed, failed.
Test Suite (Ecto Schema): A version-controlled collection of Test Cases. Examples: regression_suite_v1, math_ability_suite_v4, safety_red_team_suite_v2.
Test Case (Ecto Schema): The smallest unit of work. A test case defines an input (e.g., a prompt), an assertion (e.g., "output must be valid JSON matching this schema," or "output must contain the phrase '60 mph'"), and metadata.
Job Queue (Oban):Assessor uses Oban as its backbone. When an "Evaluation" is triggered, Assessor enqueues thousands of Oban jobs—one for each Test Case to be executed. This provides durability, retries, and scalability out of the box.
Executor (Oban Worker): The heart of the system is the Assessor.Executor module, an Oban worker that performs a single test case. This worker is where the entire ecosystem comes together.

Diagram: Assessor Evaluation Flow

graph TD
    A[User triggers Evaluation via API/UI] --> B{Assessor Service}
    B --> C["Creates Evaluation record in DB"]
    C --> D["For each TestCase in Suites, enqueue Oban Job"]
    
    subgraph "Oban Job Processing (Massively Parallel)"
      Job1["Oban Job: TestCase 1"]
      Job2["Oban Job: TestCase 2"]
      JobN["Oban Job: TestCase N..."]
    end

    D --> Job1
    D --> Job2
    D --> JobN

    subgraph "Assessor.Executor (Oban Worker)"
        direction LR
        J[Job Dequeued] --> K["AITrace.trace('test_case.execute')"]
        K --> L["DSPex Pipeline Execution"]
        L --> M["Tool Call via Altar/Snakepit"]
        M --> N["Run Assertion Logic"]
        N --> O["AITrace.end_trace()"]
        O --> P["Update TestCaseResult in DB"]
    end

    Job1 --> J
    
    Q[Phoenix LiveView Dashboard] -->|Subscribes via PubSub| B
    P -->|DB Trigger/PubSub| Q

4. Ecosystem Integration: The Assessor.Executor

The Assessor.Executor worker is a masterclass in applying the "Transparent Engine" architecture.

# lib/assessor/executor.ex

defmodule Assessor.Executor do
  use Oban.Worker, queue: :evaluations

  # An Oban job performs the work for a single test case.
  @impl Oban.Worker
  def perform(%Oban.Job{args: %{"test_case_id" => id, "model_candidate" => model}}) do
    test_case = Assessor.Repo.get(Assessor.TestCase, id)

    # 1. Every test case execution is a fully traced operation.
    AITrace.trace "assessor.test_case.execute", %{test_case_id: id, model: model.name} do
      # The `ctx` is automatically injected.
      
      # 2. Run the core logic using DSPex. The model endpoint is configured here.
      {:ok, result, new_ctx} = AITrace.span ctx, "model_inference" do
        DSPex.execute(test_case.pipeline, test_case.input, context: ctx, llm: model.endpoint)
      end
      
      # 3. Altar/Snakepit are used implicitly by the DSPex pipeline if it needs tools.
      # The AITrace instrumentation within DSPex captures all of this.

      # 4. Run the assertion to determine pass/fail.
      {:ok, assertion_result, _final_ctx} = AITrace.span new_ctx, "assertion" do
        Assessor.Assertions.run(test_case.assertion, result)
      end

      # 5. Store the detailed results.
      Assessor.Results.store(%{
        test_case_id: id,
        result: assertion_result,
        raw_output: result.raw,
        trace_id: AITrace.Context.get_trace_id(ctx) # CRITICAL: Link the result to its trace!
      })
    end
    
    :ok
  end
end

How the Ecosystem Powers Assessor:

AITrace: This is the killer feature. Every single test case execution generates a rich, debuggable trace. When a test fails, you don't just get a FAIL status; you get a trace_id that you can plug into your "Execution Cinema" UI to see the model's exact, step-by-step reasoning that led to the failure. This is a revolutionary improvement over current evaluation workflows.
DSPex:DSPex provides the language for defining the "work" to be done in a test case. A simple test might be a single Predict module, while a complex one could be a multi-step ChainOfThought or ReAct program.
Altar:Altar governs the tools available during an evaluation. This is a crucial for safety and consistency. You can ensure that an evaluation for a "customer service bot" only has access to a mock "get_order_status" tool, preventing it from touching production systems.
Snakepit:Snakepit provides the massively parallel execution grid for running the thousands of concurrent model inference calls generated by the Oban workers. Its streaming capabilities can be used for complex tests that require observing the model's output in real-time.

5. Key Features

Evaluation as Code: Test suites and cases are defined in code or configuration, allowing them to be version-controlled alongside models and prompts.
Massively Parallel Execution: Built on Oban to run tens of thousands of test cases concurrently, leveraging the full power of the BEAM.
Deeply Observable Failures: Every test result is linked to a full AITrace record, enabling instant, forensic debugging of failures.
Extensible Assertions: A pluggable assertion system allows for everything from simple string matching to complex, LLM-judged evaluations.
Results Warehouse: All evaluation results are stored in a structured database, enabling powerful analytics and trend monitoring over time.
Live Dashboard: A Phoenix LiveView interface provides a real-time view of running evaluations and a rich UI for exploring historical results and their associated traces.

Assessor completes the ecosystem by providing the critical feedback loop. It is the quality gate that ensures the intelligent agents built with Synapse are reliable, safe, and performant enough to be shipped to production.