Firebreak
Find the process coupling that crosses your OTP supervision tree — the
synchronous dependencies a restart turns into :noproc/:timeout, which the
supervision tree itself can't show.
Your supervision tree declares how your app is structured; your code declares how your processes actually talk to each other. Those are two different graphs, and the dangerous bugs live in the gap: a process in one branch of the tree synchronously depends on a process in another, so a restart the tree calls "contained" surfaces an error somewhere that looks unrelated.
That's firebreak's headline finding. Run it on Livebook, for instance, and it
flags that HomeLive/OpenLive/SessionLive call NotebookManager
synchronously inside mount/3 — but NotebookManager lives in a different
branch of the supervision tree, so if it restarts mid-request the page fails to
render. One static run, no app boot.
It reads the tree the way OTP does — by calling each supervisor's init/1 (child
specs are runtime data; init/1returns them without starting anything) — and
falls back to static AST parsing for code it can't load, marking which. The
coupling graph is always static. By default: no app boot, no LLM, no runtime
tracing — fast and deterministic enough for a CI gate.
What it is / isn't. Best-effort static analysis, not a type system: it surfaces hazards, not certainties, and metaprogramming or runtime-computed names can still hide an edge. The default report leads with the high-confidence coupling/correctness findings and collapses the advisory ones to a count (
--allshows them). For exact runtime truth — liveDynamicSupervisorchildren, recovered names — the opt-in--observemode attaches to a running node.
What it finds
firebreak - supervision & coupling analysis
1 files | 7 modules | 3 supervisors (2 exact, 1 static) | 1 coupling edges
Supervision roots:
- Demo.App
Findings (3):
[HIGH] missing_trap_exit (best-effort) Demo.Worker1 (demo.ex:40)
links a process (task_start_link) but does not trap exits; a crash in the linked process takes Demo.Worker1 down with it. Use a supervised Task or set Process.flag(:trap_exit, true).
[MED ] one_for_all_blast_radius (exact) Demo.SupA (demo.ex:9)
:one_for_all with 3 children - any single child crash restarts all 3. If these children aren't genuinely interdependent, a narrower strategy contains the blast.
[LOW ] cross_tree_coupling (best-effort) Demo.SupA (demo.ex:9)
1 module(s) outside Demo.SupA's subtree depend on processes inside it (Demo.Api); restarting it can surface :timeout/:noproc in those callers - coupling the supervision tree does not show.
Summary: 1 high, 1 medium, 1 low, 0 info
The (2 exact, 1 static) header counts how each supervisor's tree was read:
Demo.SupA/Demo.SupB via init/1 (exact), the Application statically (its
start/2 boots the tree, so Firebreak never calls it). Each finding is tagged
(exact) or (best-effort) so you know whether it rests on the real tree or on
a static read.
That last finding is the one no other tool gives you. Demo.Api lives under
Demo.SupB, but it synchronously calls Demo.Cache, which lives under
Demo.SupA. The supervision tree says these subtrees are independent. The
coupling graph says a Demo.SupA restart can surface :timeout/:noproc
inside Demo.Api — a failure path the tree alone would call "contained."
Tier 1 — structural checks
Facts you can read straight off each module, no cross-module graph needed:
| Check | What it flags |
|---|---|
one_for_all_blast_radius | :one_for_all over many children — one crash restarts every sibling |
missing_trap_exit | a GenServer that links a process but doesn't trap exits |
shutdown_exceeds_intensity_window | a child shutdown timeout that can burn the supervisor's whole restart budget |
default_restart_intensity | a large supervisor (5+ children) on a tight restart budget — the default 3 restarts / 5s, or less |
start_link_in_callback | a process started with a direct start_link inside an OTP callback (handle_*/init) — re-spawned on every invocation, so it leaks duplicates or fails on :already_started. Should be a supervised child |
lookup_or_create_race | a non-atomic registry test-and-set — one function that reads the registry (whereis/Registry.lookup/:global.whereis_name) and then creates/registers in the same body. Two callers can both miss and both create; the loser crashes (a raising register) or gets {:error, {:already_started}}, and its just-spawned process leaks as a ghost (Christakis & Sagonas, PADL 2010) |
unhandled_port_exit | a GenServer/:gen_statem that opens a port (Port.open/:erlang.open_port) but no handle_info clause handles the port's termination ({port, {:exit_status, _}} / {:EXIT, port, _}) — the external program can die without the owner noticing, or take it down via the linked exit |
Tier 2 — coupling across the tree
The differentiator. Resolves the process-to-process coupling graph
(GenServer.call/cast, :gen_server, :gen_statem, registered names,
Registry, :global, Process.whereis, :ets, Phoenix.PubSub, :pg),
maps each call to the module that owns the target, then reasons about the gap
between that graph and the supervision forest:
| Check | What it flags |
|---|---|
cross_tree_coupling | a module outside a supervisor's subtree depends on a process inside it — the restart the tree calls "contained" surfaces :timeout/:noproc in the outside caller. Synchronous callers from several modules rank highest |
supervisor_subtree_blast | :one_for_all/:rest_for_oneover child supervisors — a crash restarts whole sibling subtrees, not just a worker |
dynamic_supervisor_restart_blast | a DynamicSupervisor/Registry get-or-start race where a restart drops the registration callers rely on |
boot_order_dependency | an init/1 that synchronously calls a sibling started after it — the dependency isn't alive yet on first boot |
crash_cascade | failure simulation: "if this process crashes now, who blocks?" — follows the restart closure (:one_for_all/:rest_for_one) so a crash that co-restarts a depended-on sibling is caught even when the coupling is invisible from the call sites alone |
cyclic_coupling | a cycle of synchronous calls (A→B→A) — a deadlock hazard: each can block in handle_call awaiting the other |
boot_order_cycle | a cycle of synchronous calls made inside init/1 — the tree can't start: on boot each init waits on a peer that isn't running yet. The sharper, :high sibling of cyclic_coupling |
orphaned_stateful_process | a GenServer/:gen_statem/GenStage/Agent in no supervisor's subtree. Sharpened with evidence: supervised via a child_spec builder or via DynamicSupervisor.start_child (relabelled, likely fine), hand-started via a direct start_link outside any supervisor (the exact call site is named), or genuinely unknown |
Crossings are weighted by synchronicity: only a synchronous caller
(GenServer.call/whereis) blocks on :noproc, so async-only coupling
(cast/send/pubsub) is rated lower. Per-entity :via/:syn targets
({:via, _, {Reg, {Owner, id}}}) resolve to the keyed owner module, not the
shared registry.
The text report leads with these coupling/correctness findings and groups the structural/advisory ones (blast-radius strategies, orphan heuristics — real, but often by-design) beneath them.
Wrapper-call coupling. Apps rarely scatter GenServer.call(Server, …); they
wrap it in a public API (Server.fetch(id)). Firebreak does first-level
inter-procedural analysis: when A calls M.f(...) and M.f itself couples to
a process, it synthesises the edge from A onto that process — so a dependency
routed through an API module isn't invisible.
Runtime observation (--observe)
mix firebreak --observe app@host attaches to a running node over distributed
Erlang and folds its real shape into the analysis: live DynamicSupervisor
children become part of the forest, registered names static analysis couldn't
bind are recovered, a runtime_fanout finding reports supervisors running
far more children than the source models, and a runtime_mailbox_backlog finding
flags a process with a deep mailbox (≥1000 queued) that something calls
synchronously — a live back-pressure chokepoint where callers block on the
backlog (all at :exact confidence — it's observed reality). The target needs
nothing installed; reads use standard-library :rpc calls only.
mix firebreak --observe app@host --format overlay projects the join: every
synchronous cross-tree crossing from the static IR, annotated with its target's
live state — alive?, mailbox depth, instance count. It answers the question
neither view answers alone: of my static crossings, which targets are hot right
now? (The judgements stay in the runtime_fanout/runtime_mailbox_backlog
findings; the overlay is the structured ground-truth layer they're read from.)
Formal specs (experimental): mix firebreak.spec
Firebreak can project its findings into a verified supervision model and generate a TLA+ lifecycle spec per supervisor — turning the separate static warnings into a single, model-checkable failure scenario with a counterexample trace.
mix firebreak --format model # the model IR: per-supervisor strategy,
# intensity, ordered children (+restart type),
# parent, and inbound crossings (sync/async)
mix firebreak.spec --out specs/ # one <Supervisor>.tla + .cfg per supervisor
# then, with TLC (tla2tools.jar):
java -cp tla2tools.jar tlc2.TLC -deadlock -config <Sup>.cfg <Sup>.tla
# or generate the same model as Quint:
mix firebreak.spec --lang quint # one <Supervisor>.qnt per supervisor
quint verify --invariant SupNeverDies <Supervisor>.qnt
Each generated spec is a pure function of --format model — nothing is
hand-written. It models the restart-intensity budget and escalation, and (only
where firebreak found a synchronous crossing) the external caller's permanent
:noproc after escalation. So a supervisor with no crossing gets only the
SupNeverDies (budget) property; one with a real sync crossing also gets
ExtNeverStuck. TLC then composes findings the report lists separately — e.g.
"four :one_for_all child crashes exhaust the 3-in-5s budget, the supervisor
escalates, and a cross-tree caller is left permanently :noproc" — into one
proven trace.
Sharper cases the model captures:
:temporarytarget — never restarted, so a single crash permanently breaks the caller (TLC shows it in two steps, supervisor still alive).:one_for_alltransient amplification (TargetTransientlySafe) — any one child crash transiently downs every child, so a cross-tree caller is hit even by an unrelated sibling's crash.:one_for_oneisolates and gets no such property — the contrast is the signal.- Real
max_secondswindow — the budget is spent within a time window (Tickages it; it resets aftermax_seconds), so an escalation trace shows the crashes were a fast burst, not spread out. (A fixed-window approximation of OTP's sliding window.)
Scope (honest): :one_for_all/:one_for_one/:rest_for_one are templated. It
verifies the declared topology, so it inherits firebreak's static blind spots
— --observe narrows those.
The --format model output is a versioned, documented contract — TLA+ is just
one consumer. If you want to build your own backend (a different model checker, a
diagram, a lockstep scenario), see notes/model-ir-contract.md:
the schema, serialization, design law, and a backend-author guide.
Firebreak.Model.valid?/1 checks a projection against it.
Reproduce it dynamically: mix firebreak.lockstep
The dynamic counterpart to firebreak.spec. For each synchronous cross-tree
crossing, it generates a lockstepctest
scaffold — the starting point for a test that reproduces the :noproc failure in
the running BEAM, not just in a model. It names the two processes, sets up the
harness, and marks the app-specific TODOs (start the target, drive the call,
assert it's handled). Static finding → proof (TLA+) → executable regression test
(lockstep), all from the same model IR.
Usage
# human-readable report for the current project
mix firebreak
# point it at another project
mix firebreak ../some_app
# JSON artifact (CI handoff / tooling)
mix firebreak --format json
# graph the supervision forest + coupling (crossing edges highlighted)
mix firebreak --format dot | dot -Tsvg -o firebreak.svg
mix firebreak --format mermaid # paste into a Markdown doc
mix firebreak --format html > report.html # findings + graph in one page
mix firebreak --format failure # Mermaid of just the failure modes (who :noproc-blocks)
# a structural supervision-risk score + per-supervisor ranking (dashboards/trend)
mix firebreak --format score
# join the static crossings against a live node's observed state (needs --observe)
mix firebreak --observe app@host --format overlay
# CI: emit GitHub Actions annotations (one per finding, on the PR diff)
mix firebreak --format github
# fold in a running node's real runtime shape
mix firebreak --observe my_app@127.0.0.1 --cookie secret
# only show medium and above
mix firebreak --min-severity medium
# extra source globs (repeatable)
mix firebreak --path "test/support/**/*.ex"
# skip compilation and analyse statically only
mix firebreak --no-compile
mix firebreak compiles the current project first so it can read supervision
trees exactly from init/1; in CI, where the app is already built, that's a
no-op. It never starts your tree — init/1 only returns child specs. Pass
--no-compile to stay purely static (best-effort), or point Firebreak at
another project (mix firebreak ../some_app), where it uses that project's
_build artifacts if present and falls back to static parsing otherwise.
CI gate
Fail the build when a new high-severity finding lands:
# .github/workflows/firebreak.yml
- run: mix firebreak --fail-on high
--fail-on <severity> exits non-zero if any finding at or above that severity
is present. Pair it with --format json if you want to archive the full report
as a build artifact.
There's a bundled GitHub Action (action.yml) and an example workflow in
.github/workflows/firebreak.yml: it runs --format github to annotate the PR
diff with each finding, then gates the job on --fail-on. Copy the workflow
into your project, or uses: b-erdem/firebreak@main once it's published.
Suppression and baselines
For an existing codebase with a backlog, gate on new coupling rather than the whole pile:
# accept a reviewed finding forever: commit a .firebreak.exs in the project root
# %{suppress: [
# %{check: :cross_tree_coupling, module: MyApp.Cache.Supervisor},
# "boot_order_dependency:MyApp.Early/MyApp.Late" # or an exact signature
# ]}
# snapshot today's findings once, on a green commit
mix firebreak --write-baseline .firebreak_baseline.exs
# thereafter, fail only on findings absent from the baseline
mix firebreak --baseline .firebreak_baseline.exs --fail-on info
Both match findings by a stable signature (check:module) that ignores line
numbers and message wording, so the allowlist doesn't churn as unrelated code
moves. --config overrides the default .firebreak.exs path.
Topology conformance (--expect)
The baseline pins the findings you've accepted; conformance pins the shape of
the tree you designed. Snapshot the intended supervision topology once, commit
it, and fail the build when the tree drifts from it — a strategy quietly flipped
to :one_for_all, a child dropped out of a supervisor, the restart intensity
loosened:
# snapshot the intended topology on a green commit, and commit the file
mix firebreak --write-expected config/expected_topology.exs
# thereafter, report topology_drift findings (and gate on them) when the tree changes
mix firebreak --expect config/expected_topology.exs --fail-on medium
Drift findings carry a stable topology_drift:<sup>/<subtype> signature, so they
suppress and baseline like any other finding. The spec is a plain Elixir term
(read like .firebreak.exs) — diff it in code review to see the topology
change a PR introduces.
Installation
Add firebreak to your dev/test dependencies:
def deps do
[
{:firebreak, "~> 0.2.0", only: [:dev, :test], runtime: false}
]
end
How it works
- Parse every source file to AST and collect module facts —
use/behaviours, supervisor strategy and intensity, child specs, name registrations, links and spawns, and outbound calls. - Resolve the tree exactly where possible: for each loadable supervisor, call
Mod.init/1(the same call OTP'ssupervisormakes) to read the real{flags, child_specs}— without starting a thing — and replace the static guess. Un-loadable modules andApplicationroots keep the static read. - Resolve the coupling graph: map each call target (a module, or a registered name) to the module that owns it, and add first-level wrapper edges (a caller of a public API that itself couples to a process). Always static.
- Build the forest: supervisors, roots, and each supervisor's subtree.
- Check: run the Tier-1 structural rules and the Tier-2 cross-tree pass
(including failure simulation and the orphan check), tagging each finding
exactorbest-effort. - Optionally observe (
--observe): attach to a live node and fold its real runtime shape into the analysis before the checks run.
The coupling graph is a best-effort static read, not a type system: runtime name
computation and metaprogramming can hide an edge, and when in doubt Firebreak
stays quiet rather than guessing. The supervision tree, when read from init/1,
is exact.
License
MIT