Firebreak

Find the process coupling that crosses your OTP supervision tree — the synchronous dependencies a restart turns into :noproc/:timeout, which the supervision tree itself can't show.

Your supervision tree declares how your app is structured; your code declares how your processes actually talk to each other. Those are two different graphs, and the dangerous bugs live in the gap: a process in one branch of the tree synchronously depends on a process in another, so a restart the tree calls "contained" surfaces an error somewhere that looks unrelated.

That's firebreak's headline finding. Run it on Livebook, for instance, and it flags that HomeLive/OpenLive/SessionLive call NotebookManager synchronously inside mount/3 — but NotebookManager lives in a different branch of the supervision tree, so if it restarts mid-request the page fails to render. One static run, no app boot.

It reads the tree the way OTP does — by calling each supervisor's init/1 (child specs are runtime data; init/1returns them without starting anything) — and falls back to static AST parsing for code it can't load, marking which. The coupling graph is always static. By default: no app boot, no LLM, no runtime tracing — fast and deterministic enough for a CI gate.

What it is / isn't. Best-effort static analysis, not a type system: it surfaces hazards, not certainties, and metaprogramming or runtime-computed names can still hide an edge. The default report leads with the high-confidence coupling/correctness findings and collapses the advisory ones to a count (--all shows them). For exact runtime truth — live DynamicSupervisor children, recovered names — the opt-in --observe mode attaches to a running node.

What it finds

firebreak - supervision & coupling analysis
  1 files | 7 modules | 3 supervisors (2 exact, 1 static) | 1 coupling edges

Supervision roots:
  - Demo.App

Findings (3):
  [HIGH] missing_trap_exit (best-effort)  Demo.Worker1 (demo.ex:40)
         links a process (task_start_link) but does not trap exits; a crash in the linked process takes Demo.Worker1 down with it. Use a supervised Task or set Process.flag(:trap_exit, true).
  [MED ] one_for_all_blast_radius (exact)  Demo.SupA (demo.ex:9)
         :one_for_all with 3 children - any single child crash restarts all 3. If these children aren't genuinely interdependent, a narrower strategy contains the blast.
  [LOW ] cross_tree_coupling (best-effort)  Demo.SupA (demo.ex:9)
         1 module(s) outside Demo.SupA's subtree depend on processes inside it (Demo.Api); restarting it can surface :timeout/:noproc in those callers - coupling the supervision tree does not show.

Summary: 1 high, 1 medium, 1 low, 0 info

The (2 exact, 1 static) header counts how each supervisor's tree was read: Demo.SupA/Demo.SupB via init/1 (exact), the Application statically (its start/2 boots the tree, so Firebreak never calls it). Each finding is tagged (exact) or (best-effort) so you know whether it rests on the real tree or on a static read.

That last finding is the one no other tool gives you. Demo.Api lives under Demo.SupB, but it synchronously calls Demo.Cache, which lives under Demo.SupA. The supervision tree says these subtrees are independent. The coupling graph says a Demo.SupA restart can surface :timeout/:noproc inside Demo.Api — a failure path the tree alone would call "contained."

Tier 1 — structural checks

Facts you can read straight off each module, no cross-module graph needed:

Check	What it flags
`one_for_all_blast_radius`	`:one_for_all` over many children — one crash restarts every sibling
`missing_trap_exit`	a GenServer that links a process but doesn't trap exits
`shutdown_exceeds_intensity_window`	a child shutdown timeout that can burn the supervisor's whole restart budget
`default_restart_intensity`	a large supervisor (5+ children) on a tight restart budget — the default `3 restarts / 5s`, or less
`start_link_in_callback`	a process started with a direct `start_link` inside an OTP callback (`handle_*`/`init`) — re-spawned on every invocation, so it leaks duplicates or fails on `:already_started`. Should be a supervised child
`lookup_or_create_race`	a non-atomic registry test-and-set — one function that reads the registry (`whereis`/`Registry.lookup`/`:global.whereis_name`) and then creates/registers in the same body. Two callers can both miss and both create; the loser crashes (a raising `register`) or gets `{:error, {:already_started}}`, and its just-spawned process leaks as a ghost (Christakis & Sagonas, PADL 2010)
`unhandled_port_exit`	a `GenServer`/`:gen_statem` that opens a port (`Port.open`/`:erlang.open_port`) but no `handle_info` clause handles the port's termination (`{port, {:exit_status, _}}` / `{:EXIT, port, _}`) — the external program can die without the owner noticing, or take it down via the linked exit

Tier 2 — coupling across the tree

The differentiator. Resolves the process-to-process coupling graph (GenServer.call/cast, :gen_server, :gen_statem, registered names, Registry, :global, Process.whereis, :ets, Phoenix.PubSub, :pg), maps each call to the module that owns the target, then reasons about the gap between that graph and the supervision forest:

Check	What it flags
`cross_tree_coupling`	a module outside a supervisor's subtree depends on a process inside it — the restart the tree calls "contained" surfaces `:timeout`/`:noproc` in the outside caller. Synchronous callers from several modules rank highest
`supervisor_subtree_blast`	`:one_for_all`/`:rest_for_one`over child supervisors — a crash restarts whole sibling subtrees, not just a worker
`dynamic_supervisor_restart_blast`	a `DynamicSupervisor`/`Registry` get-or-start race where a restart drops the registration callers rely on
`boot_order_dependency`	an `init/1` that synchronously calls a sibling started after it — the dependency isn't alive yet on first boot
`crash_cascade`	failure simulation: "if this process crashes now, who blocks?" — follows the restart closure (`:one_for_all`/`:rest_for_one`) so a crash that co-restarts a depended-on sibling is caught even when the coupling is invisible from the call sites alone
`cyclic_coupling`	a cycle of synchronous calls (A→B→A) — a deadlock hazard: each can block in `handle_call` awaiting the other
`boot_order_cycle`	a cycle of synchronous calls made inside `init/1` — the tree can't start: on boot each `init` waits on a peer that isn't running yet. The sharper, `:high` sibling of `cyclic_coupling`
`orphaned_stateful_process`	a `GenServer`/`:gen_statem`/`GenStage`/`Agent` in no supervisor's subtree. Sharpened with evidence: supervised via a `child_spec` builder or via `DynamicSupervisor.start_child` (relabelled, likely fine), hand-started via a direct `start_link` outside any supervisor (the exact call site is named), or genuinely unknown

Crossings are weighted by synchronicity: only a synchronous caller (GenServer.call/whereis) blocks on :noproc, so async-only coupling (cast/send/pubsub) is rated lower. Per-entity :via/:syn targets ({:via, _, {Reg, {Owner, id}}}) resolve to the keyed owner module, not the shared registry.

The text report leads with these coupling/correctness findings and groups the structural/advisory ones (blast-radius strategies, orphan heuristics — real, but often by-design) beneath them.

Wrapper-call coupling. Apps rarely scatter GenServer.call(Server, …); they wrap it in a public API (Server.fetch(id)). Firebreak does first-level inter-procedural analysis: when A calls M.f(...) and M.f itself couples to a process, it synthesises the edge from A onto that process — so a dependency routed through an API module isn't invisible.

Runtime observation (`--observe`)

mix firebreak --observe app@host attaches to a running node over distributed Erlang and folds its real shape into the analysis: live DynamicSupervisor children become part of the forest, registered names static analysis couldn't bind are recovered, a runtime_fanout finding reports supervisors running far more children than the source models, and a runtime_mailbox_backlog finding flags a process with a deep mailbox (≥1000 queued) that something calls synchronously — a live back-pressure chokepoint where callers block on the backlog (all at :exact confidence — it's observed reality). The target needs nothing installed; reads use standard-library :rpc calls only.

mix firebreak --observe app@host --format overlay projects the join: every synchronous cross-tree crossing from the static IR, annotated with its target's live state — alive?, mailbox depth, instance count. It answers the question neither view answers alone: of my static crossings, which targets are hot right now? (The judgements stay in the runtime_fanout/runtime_mailbox_backlog findings; the overlay is the structured ground-truth layer they're read from.)

Formal specs (experimental): `mix firebreak.spec`

Firebreak can project its findings into a verified supervision model and generate a TLA+ lifecycle spec per supervisor — turning the separate static warnings into a single, model-checkable failure scenario with a counterexample trace.

mix firebreak --format model        # the model IR: per-supervisor strategy,
                                     # intensity, ordered children (+restart type),
                                     # parent, and inbound crossings (sync/async)
mix firebreak.spec --out specs/      # one <Supervisor>.tla + .cfg per supervisor
# then, with TLC (tla2tools.jar):
java -cp tla2tools.jar tlc2.TLC -deadlock -config <Sup>.cfg <Sup>.tla

# or generate the same model as Quint:
mix firebreak.spec --lang quint     # one <Supervisor>.qnt per supervisor
quint verify --invariant SupNeverDies <Supervisor>.qnt

Each generated spec is a pure function of --format model — nothing is hand-written. It models the restart-intensity budget and escalation, and (only where firebreak found a synchronous crossing) the external caller's permanent :noproc after escalation. So a supervisor with no crossing gets only the SupNeverDies (budget) property; one with a real sync crossing also gets ExtNeverStuck. TLC then composes findings the report lists separately — e.g. "four :one_for_all child crashes exhaust the 3-in-5s budget, the supervisor escalates, and a cross-tree caller is left permanently :noproc" — into one proven trace.

Sharper cases the model captures:

:temporary target — never restarted, so a single crash permanently breaks the caller (TLC shows it in two steps, supervisor still alive).
:one_for_all transient amplification (TargetTransientlySafe) — any one child crash transiently downs every child, so a cross-tree caller is hit even by an unrelated sibling's crash. :one_for_one isolates and gets no such property — the contrast is the signal.
Real max_seconds window — the budget is spent within a time window (Tick ages it; it resets after max_seconds), so an escalation trace shows the crashes were a fast burst, not spread out. (A fixed-window approximation of OTP's sliding window.)

Scope (honest): :one_for_all/:one_for_one/:rest_for_one are templated. It verifies the declared topology, so it inherits firebreak's static blind spots — --observe narrows those.

The --format model output is a versioned, documented contract — TLA+ is just one consumer. If you want to build your own backend (a different model checker, a diagram, a lockstep scenario), see notes/model-ir-contract.md: the schema, serialization, design law, and a backend-author guide. Firebreak.Model.valid?/1 checks a projection against it.

Reproduce it dynamically: `mix firebreak.lockstep`

The dynamic counterpart to firebreak.spec. For each synchronous cross-tree crossing, it generates a lockstepctest scaffold — the starting point for a test that reproduces the :noproc failure in the running BEAM, not just in a model. It names the two processes, sets up the harness, and marks the app-specific TODOs (start the target, drive the call, assert it's handled). Static finding → proof (TLA+) → executable regression test (lockstep), all from the same model IR.

Usage

# human-readable report for the current project
mix firebreak

# point it at another project
mix firebreak ../some_app

# JSON artifact (CI handoff / tooling)
mix firebreak --format json

# graph the supervision forest + coupling (crossing edges highlighted)
mix firebreak --format dot | dot -Tsvg -o firebreak.svg
mix firebreak --format mermaid          # paste into a Markdown doc
mix firebreak --format html > report.html   # findings + graph in one page
mix firebreak --format failure          # Mermaid of just the failure modes (who :noproc-blocks)

# a structural supervision-risk score + per-supervisor ranking (dashboards/trend)
mix firebreak --format score

# join the static crossings against a live node's observed state (needs --observe)
mix firebreak --observe app@host --format overlay

# CI: emit GitHub Actions annotations (one per finding, on the PR diff)
mix firebreak --format github

# fold in a running node's real runtime shape
mix firebreak --observe my_app@127.0.0.1 --cookie secret

# only show medium and above
mix firebreak --min-severity medium

# extra source globs (repeatable)
mix firebreak --path "test/support/**/*.ex"

# skip compilation and analyse statically only
mix firebreak --no-compile

mix firebreak compiles the current project first so it can read supervision trees exactly from init/1; in CI, where the app is already built, that's a no-op. It never starts your tree — init/1 only returns child specs. Pass --no-compile to stay purely static (best-effort), or point Firebreak at another project (mix firebreak ../some_app), where it uses that project's _build artifacts if present and falls back to static parsing otherwise.

CI gate

Fail the build when a new high-severity finding lands:

# .github/workflows/firebreak.yml
- run: mix firebreak --fail-on high

--fail-on <severity> exits non-zero if any finding at or above that severity is present. Pair it with --format json if you want to archive the full report as a build artifact.

There's a bundled GitHub Action (action.yml) and an example workflow in .github/workflows/firebreak.yml: it runs --format github to annotate the PR diff with each finding, then gates the job on --fail-on. Copy the workflow into your project, or uses: b-erdem/firebreak@main once it's published.

Suppression and baselines

For an existing codebase with a backlog, gate on new coupling rather than the whole pile:

# accept a reviewed finding forever: commit a .firebreak.exs in the project root
#   %{suppress: [
#     %{check: :cross_tree_coupling, module: MyApp.Cache.Supervisor},
#     "boot_order_dependency:MyApp.Early/MyApp.Late"   # or an exact signature
#   ]}

# snapshot today's findings once, on a green commit
mix firebreak --write-baseline .firebreak_baseline.exs

# thereafter, fail only on findings absent from the baseline
mix firebreak --baseline .firebreak_baseline.exs --fail-on info

Both match findings by a stable signature (check:module) that ignores line numbers and message wording, so the allowlist doesn't churn as unrelated code moves. --config overrides the default .firebreak.exs path.

Topology conformance (`--expect`)

The baseline pins the findings you've accepted; conformance pins the shape of the tree you designed. Snapshot the intended supervision topology once, commit it, and fail the build when the tree drifts from it — a strategy quietly flipped to :one_for_all, a child dropped out of a supervisor, the restart intensity loosened:

# snapshot the intended topology on a green commit, and commit the file
mix firebreak --write-expected config/expected_topology.exs

# thereafter, report topology_drift findings (and gate on them) when the tree changes
mix firebreak --expect config/expected_topology.exs --fail-on medium

Drift findings carry a stable topology_drift:<sup>/<subtype> signature, so they suppress and baseline like any other finding. The spec is a plain Elixir term (read like .firebreak.exs) — diff it in code review to see the topology change a PR introduces.

Installation

Add firebreak to your dev/test dependencies:

def deps do
  [
    {:firebreak, "~> 0.2.0", only: [:dev, :test], runtime: false}
  ]
end

How it works

Parse every source file to AST and collect module facts — use/behaviours, supervisor strategy and intensity, child specs, name registrations, links and spawns, and outbound calls.
Resolve the tree exactly where possible: for each loadable supervisor, call Mod.init/1 (the same call OTP's supervisor makes) to read the real {flags, child_specs} — without starting a thing — and replace the static guess. Un-loadable modules and Application roots keep the static read.
Resolve the coupling graph: map each call target (a module, or a registered name) to the module that owns it, and add first-level wrapper edges (a caller of a public API that itself couples to a process). Always static.
Build the forest: supervisors, roots, and each supervisor's subtree.
Check: run the Tier-1 structural rules and the Tier-2 cross-tree pass (including failure simulation and the orphan check), tagging each finding exact or best-effort.
Optionally observe (--observe): attach to a live node and fold its real runtime shape into the analysis before the checks run.

The coupling graph is a best-effort static read, not a type system: runtime name computation and metaprogramming can hide an edge, and when in doubt Firebreak stays quiet rather than guessing. The supervision tree, when read from init/1, is exact.

License

MIT