cmdc_eval

CMDC Agent benchmark harness —— 接公开基准 + 自定义 suite,输出 JSONL 报告, 与 LangSmith / Langfuse / Datadog 同源消费。

cmdc_evalcmdc 的独立子库,提供:

安装

def deps do
  [
    {:cmdc, "~> 0.5"},
    {:cmdc_eval, "~> 0.1"}
  ]
end

Quick Start

1. 跑 Internal Suite(cmdc kernel 自验证)

$ mix cmdc.eval --suite=internal --model="anthropic:claude-sonnet-4-5" --report=internal.jsonl

输出:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Suite:   internal
Model:   anthropic:claude-sonnet-4-5
Cases:   5
Pass:    5  (rate=1.0)
Fail:    0
Latency: avg=1234.0ms total=6170ms
Tokens:  in=234 out=567
Cost:    $0.0123
Report:  internal.jsonl
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

2. 跑 BFCL v3(公开基准)

# 1. 先 fetch fixtures(v0.1 写占位,真实数据见 cmdc.eval.fetch_bfcl moduledoc)
$ mix cmdc.eval.fetch_bfcl

# 2. 跑 evals
$ mix cmdc.eval --suite=bfcl --model="openai:gpt-4o" --report=bfcl.jsonl

3. 程序化调用

{:ok, report} = CMDCEval.run(
  suite: CMDCEval.Suites.Internal,
  model: "anthropic:claude-sonnet-4-5",
  concurrency: 4,
  timeout_ms: 60_000
)

report.summary
# => %{total: 5, pass: 5, fail: 0, pass_rate: 1.0, ...}

report.runs
# => [%CMDCEval.Run{case_id: "basic_text", pass: true, ...}, ...]

4. 自定义 Suite

defmodule MyApp.MySuite do
  @behaviour CMDCEval.Suite

  alias CMDCEval.Case

  @impl true
  def name, do: "my_app_evals"

  @impl true
  def cases do
    [
      Case.new(id: "task_a", input: "Solve task A", expected: ~r/done/),
      Case.new(id: "task_b", input: "Solve task B", expected: ~r/completed/)
    ]
  end

  @impl true
  def assert(%Case{expected: %Regex{} = re}, reply), do: Regex.match?(re, reply)
end

# 跑
{:ok, report} = CMDCEval.run(
  suite: MyApp.MySuite,
  model: "anthropic:claude-sonnet-4-5"
)

报告 JSONL Schema

每行一个 Run 的 JSON。下游可被 LangSmith / Langfuse / Datadog 直接消费:

{
  "suite": "internal",
  "case_id": "basic_text",
  "model": "anthropic:claude-sonnet-4-5",
  "pass": true,
  "latency_ms": 1234,
  "tokens_in": 234,
  "tokens_out": 567,
  "cost_usd": 0.0123,
  "events_digest": null,
  "error": null,
  "timestamp": "2026-05-18T12:34:56Z",
  "metadata": {"category": "smoke"}
}

字段稳定 —— 不会在 minor 版本删/改字段,新字段会通过 metadata 透传。

v0.1 范围

已实现

🔁 推后到 v0.2

CLI 退出码

License

Apache-2.0