Assessor

Document Version: 1.0 Status: Proposed

1. Vision & Mission

Project Name:Assessor

Vision: To be the definitive Continuous Integration & Continuous Delivery (CI/CD) platform for AI Quality, enabling teams to ship AI-powered features with the same rigor, confidence, and automation as traditional software.

Mission: To provide a scalable, extensible, and observable framework for defining, executing, and analyzing complex evaluation pipelines for AI models and prompts. Assessor transforms AI evaluation from a manual, ad-hoc process into a systematic, automated engineering discipline.

2. The Problem: The Missing Link in MLOps

While the industry has mature CI/CD tools for code (like Jenkins, GitHub Actions), the process for validating AI components remains dangerously primitive. An engineer can change a single sentence in a prompt or fine-tune a model with a new dataset, and in doing so, silently break a dozen critical capabilities.

The core challenges are:

Assessor is designed to solve these problems by treating "AI evaluations as code," managed within a robust, parallel execution engine.

3. Core Concepts & Architecture

Assessor is a complete OTP application, built around a high-throughput, durable job processing core. It is architected using the "Transparent Engine" model to ensure every evaluation is deeply observable.

The Core Architectural Primitives:

Diagram: Assessor Evaluation Flow

graph TD
    A[User triggers Evaluation via API/UI] --> B{Assessor Service}
    B --> C["Creates Evaluation record in DB"]
    C --> D["For each TestCase in Suites, enqueue Oban Job"]
    
    subgraph "Oban Job Processing (Massively Parallel)"
      Job1["Oban Job: TestCase 1"]
      Job2["Oban Job: TestCase 2"]
      JobN["Oban Job: TestCase N..."]
    end

    D --> Job1
    D --> Job2
    D --> JobN

    subgraph "Assessor.Executor (Oban Worker)"
        direction LR
        J[Job Dequeued] --> K["AITrace.trace('test_case.execute')"]
        K --> L["DSPex Pipeline Execution"]
        L --> M["Tool Call via Altar/Snakepit"]
        M --> N["Run Assertion Logic"]
        N --> O["AITrace.end_trace()"]
        O --> P["Update TestCaseResult in DB"]
    end

    Job1 --> J
    
    Q[Phoenix LiveView Dashboard] -->|Subscribes via PubSub| B
    P -->|DB Trigger/PubSub| Q

4. Ecosystem Integration: The Assessor.Executor

The Assessor.Executor worker is a masterclass in applying the "Transparent Engine" architecture.

# lib/assessor/executor.ex

defmodule Assessor.Executor do
  use Oban.Worker, queue: :evaluations

  # An Oban job performs the work for a single test case.
  @impl Oban.Worker
  def perform(%Oban.Job{args: %{"test_case_id" => id, "model_candidate" => model}}) do
    test_case = Assessor.Repo.get(Assessor.TestCase, id)

    # 1. Every test case execution is a fully traced operation.
    AITrace.trace "assessor.test_case.execute", %{test_case_id: id, model: model.name} do
      # The `ctx` is automatically injected.
      
      # 2. Run the core logic using DSPex. The model endpoint is configured here.
      {:ok, result, new_ctx} = AITrace.span ctx, "model_inference" do
        DSPex.execute(test_case.pipeline, test_case.input, context: ctx, llm: model.endpoint)
      end
      
      # 3. Altar/Snakepit are used implicitly by the DSPex pipeline if it needs tools.
      # The AITrace instrumentation within DSPex captures all of this.

      # 4. Run the assertion to determine pass/fail.
      {:ok, assertion_result, _final_ctx} = AITrace.span new_ctx, "assertion" do
        Assessor.Assertions.run(test_case.assertion, result)
      end

      # 5. Store the detailed results.
      Assessor.Results.store(%{
        test_case_id: id,
        result: assertion_result,
        raw_output: result.raw,
        trace_id: AITrace.Context.get_trace_id(ctx) # CRITICAL: Link the result to its trace!
      })
    end
    
    :ok
  end
end

How the Ecosystem Powers Assessor:

5. Key Features

Assessor completes the ecosystem by providing the critical feedback loop. It is the quality gate that ensures the intelligent agents built with Synapse are reliable, safe, and performant enough to be shipped to production.