cmdc_eval

CMDC Agent benchmark harness —— 接公开基准 + 自定义 suite,输出 JSONL 报告, 与 LangSmith / Langfuse / Datadog 同源消费。

cmdc_evalcmdc 的独立子库,提供:

安装

def deps do
[
{:cmdc, "~> 0.5"},
{:cmdc_eval, "~> 0.2"}
]
end

Quick Start

1. 跑 Internal Suite(cmdc kernel 自验证)

$ mix cmdc.eval --suite=internal --model="anthropic:claude-sonnet-4-5" --report=internal.jsonl

输出:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Suite: internal
Model: anthropic:claude-sonnet-4-5
Cases: 5
Pass: 5 (rate=1.0)
Fail: 0
Latency: avg=1234.0ms total=6170ms
Tokens: in=234 out=567
Cost: $0.0123
Report: internal.jsonl
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

2. 跑 BFCL v3(公开基准)

# 1. 先 fetch fixtures(v0.1 写占位,真实数据见 cmdc.eval.fetch_bfcl moduledoc)
$ mix cmdc.eval.fetch_bfcl
# 2. 跑 evals
$ mix cmdc.eval --suite=bfcl --model="openai:gpt-4o" --report=bfcl.jsonl

3. 程序化调用

{:ok, report} = CMDCEval.run(
suite: CMDCEval.Suites.Internal,
model: "anthropic:claude-sonnet-4-5",
concurrency: 4,
timeout_ms: 60_000
)
report.summary
# => %{total: 5, pass: 5, fail: 0, pass_rate: 1.0, ...}
report.runs
# => [%CMDCEval.Run{case_id: "basic_text", pass: true, ...}, ...]

4. 自定义 Suite

defmodule MyApp.MySuite do
@behaviour CMDCEval.Suite
alias CMDCEval.Case
@impl true
def name, do: "my_app_evals"
@impl true
def cases do
[
Case.new(id: "task_a", input: "Solve task A", expected: ~r/done/),
Case.new(id: "task_b", input: "Solve task B", expected: ~r/completed/)
]
end
@impl true
def assert(%Case{expected: %Regex{} = re}, reply), do: Regex.match?(re, reply)
end
# 跑
{:ok, report} = CMDCEval.run(
suite: MyApp.MySuite,
model: "anthropic:claude-sonnet-4-5"
)

报告 JSONL Schema

每行一个 Run 的 JSON。下游可被 LangSmith / Langfuse / Datadog 直接消费:

{
"suite": "internal",
"case_id": "basic_text",
"model": "anthropic:claude-sonnet-4-5",
"pass": true,
"latency_ms": 1234,
"tokens_in": 234,
"tokens_out": 567,
"cost_usd": 0.0123,
"events_digest": null,
"error": null,
"timestamp": "2026-05-18T12:34:56Z",
"metadata": {"category": "smoke"}
}

字段稳定 —— 不会在 minor 版本删/改字段,新字段会通过 metadata 透传。

RAG Suite 示例

defmodule MyApp.RAGEvalSuite do
@behaviour CMDCEval.Suite
alias CMDCEval.{Assertions.RAG, Case}
def name, do: "rag_regression"
def cases do
[
Case.new(
id: "policy-approval",
input: "高风险操作需要审批吗?",
expected: %{rag: %{expected_chunk_ids: ["approval-policy-c1"]}},
metadata: %{allowed_collections: ["policies"]}
)
]
end
def assert(_case, _reply, context) do
RAG.recall_at_k(context, 5, 1.0) and
RAG.contains_citation(context) and
RAG.no_unauthorized_source(context) and
RAG.faithfulness_min(context, 0.8)
end
end

Workflow Eval 示例

alias CMDCEval.Assertions.Workflow
{:ok, snapshot} = CMDCOrchestrator.status(run_id)
context =
CMDCEval.Workflow.from_status(snapshot,
expected_branches: ["approved", "default"]
)
Workflow.gate(context,
completion_rate_min: 1.0,
node_failure_rate_lte: 0.0,
branch_coverage_min: 1.0,
human_task_sla_ms_lte: 86_400_000,
retry_count_lte: 2,
cost_usd_lte: 1.0,
latency_ms_lte: 300_000,
require_fork_join_satisfied: true
)
# => true / false
CMDCEval.Assertions.Workflow.gate_failures(context, human_task_sla_ms_lte: 1_000)
# => [%{metric: :human_task_sla_ms, expected: {:<=, 1000}, actual: 4200}]

Workflow Eval 只消费 Run / NodeRun / RunEvent 的稳定 ledger shape,不依赖 Phoenix schema、Trace Viewer 或 Eval Dashboard。企业平台可以在 AgentSpec / Workflow 发布 审批前运行这组门禁,失败时把 gate_failures/2 展示给发布人。

v0.2 范围

新增

v0.1 范围

已实现

🔁 推后到 v0.2

CLI 退出码

License

Apache-2.0