cmdc_eval
CMDC Agent benchmark harness —— 接公开基准 + 自定义 suite,输出 JSONL 报告, 与 LangSmith / Langfuse / Datadog 同源消费。
cmdc_eval 是 cmdc 的独立子库,提供:
-
标准 Suite behaviour(
CMDCEval.Suite),实现 3 callback 即可注册一个评测集 - 内置 Internal Suite —— 验证 cmdc kernel 内部特性(DAG / Steering / Checkpoint)的回归基准
- BFCL v3 Suite 接入框架 —— Berkeley Function Calling Leaderboard,公开数据
Mix.Tasks.Cmdc.EvalCLI —— 一行命令跑 evals + 输出 JSONL- 稳定 JSONL 报告 schema ——
suite / case_id / model / pass / latency_ms / tokens_in / tokens_out / cost_usd / events_digest
安装
def deps do
[
{:cmdc, "~> 0.5"},
{:cmdc_eval, "~> 0.1"}
]
endQuick Start
1. 跑 Internal Suite(cmdc kernel 自验证)
$ mix cmdc.eval --suite=internal --model="anthropic:claude-sonnet-4-5" --report=internal.jsonl输出:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Suite: internal
Model: anthropic:claude-sonnet-4-5
Cases: 5
Pass: 5 (rate=1.0)
Fail: 0
Latency: avg=1234.0ms total=6170ms
Tokens: in=234 out=567
Cost: $0.0123
Report: internal.jsonl
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━2. 跑 BFCL v3(公开基准)
# 1. 先 fetch fixtures(v0.1 写占位,真实数据见 cmdc.eval.fetch_bfcl moduledoc)
$ mix cmdc.eval.fetch_bfcl
# 2. 跑 evals
$ mix cmdc.eval --suite=bfcl --model="openai:gpt-4o" --report=bfcl.jsonl3. 程序化调用
{:ok, report} = CMDCEval.run(
suite: CMDCEval.Suites.Internal,
model: "anthropic:claude-sonnet-4-5",
concurrency: 4,
timeout_ms: 60_000
)
report.summary
# => %{total: 5, pass: 5, fail: 0, pass_rate: 1.0, ...}
report.runs
# => [%CMDCEval.Run{case_id: "basic_text", pass: true, ...}, ...]4. 自定义 Suite
defmodule MyApp.MySuite do
@behaviour CMDCEval.Suite
alias CMDCEval.Case
@impl true
def name, do: "my_app_evals"
@impl true
def cases do
[
Case.new(id: "task_a", input: "Solve task A", expected: ~r/done/),
Case.new(id: "task_b", input: "Solve task B", expected: ~r/completed/)
]
end
@impl true
def assert(%Case{expected: %Regex{} = re}, reply), do: Regex.match?(re, reply)
end
# 跑
{:ok, report} = CMDCEval.run(
suite: MyApp.MySuite,
model: "anthropic:claude-sonnet-4-5"
)报告 JSONL Schema
每行一个 Run 的 JSON。下游可被 LangSmith / Langfuse / Datadog 直接消费:
{
"suite": "internal",
"case_id": "basic_text",
"model": "anthropic:claude-sonnet-4-5",
"pass": true,
"latency_ms": 1234,
"tokens_in": 234,
"tokens_out": 567,
"cost_usd": 0.0123,
"events_digest": null,
"error": null,
"timestamp": "2026-05-18T12:34:56Z",
"metadata": {"category": "smoke"}
}
字段稳定 —— 不会在 minor 版本删/改字段,新字段会通过 metadata 透传。
v0.1 范围
✅ 已实现:
- Suite behaviour + 4 struct(Case / Run / Report / Suite)
Mix.Tasks.Cmdc.Eval+Mix.Tasks.Cmdc.Eval.FetchBfclCMDCEval.Suites.Internal—— 5 个 cmdc kernel 自验证 caseCMDCEval.Suites.BFCL—— 框架 + fetch_bfcl 占位- JSONL 报告 schema + summary 聚合
- 11+ 单元测试覆盖 struct + Suite + Runner(mock provider 端到端)
🔁 推后到 v0.2:
- BFCL v3 fixtures 自动 fetch(v0.1 写占位,需手动 git clone 上游)
- tau2-bench airline suite
- MemoryAgentBench 子集(依赖 cmdc_memory_pg PG 集成)
- LangSmith 直接同步(OTLP)
- 完整 BFCL 5 子类(multiple / parallel / parallel_multiple / multi_turn)
CLI 退出码
0—— 所有 case pass1—— 有 case 失败2—— Suite 无 case(如 BFCL fixtures 未 fetch)3—— Suite 模块不存在或非法
License
Apache-2.0