phi_accrual

A source-agnostic φ-accrual failure detector for Elixir/OTP, built on Hayashibara et al. 2004 with a dual-α EWMA estimator, head-of-line and local-pause awareness, and a telemetry-first API.

⚠️ Alpha — v0.1.x. The API and configuration surface may change before v1.0. The telemetry event schema is already stable (see Versioning), but everything else is subject to tuning based on real-deployment feedback. Production use at your own risk; please open issues as you find rough edges.

Observability-grade, not decision-grade. Designed for dashboards, alerting, and operator intuition — not for automated routing, quorum, or correctness decisions. See limitations for why.

Quick start

# mix.exs
def deps do
  [{:phi_accrual, "~> 0.1"}]
end

The application auto-starts. Feed in heartbeat arrivals from anywhere your code already receives cross-node traffic, and read out φ on demand:

# Call this whenever you receive evidence that a peer is alive —
# a GenServer reply, a :pg broadcast, an :rpc response, a custom ping.
# First call for an unknown node auto-tracks it with defaults.
PhiAccrual.observe(:"peer@host")

# Query φ at any time.
PhiAccrual.phi(:"peer@host")
#=> {:ok, 0.42, :steady}

That's the whole core loop: feed in arrivals, read out φ. Everything below is about making it useful in production — reference heartbeat sources if you have none of your own, telemetry wiring for Prometheus, thresholding with hysteresis, and honest limitations.

What it does

Given a stream of heartbeat arrivals from a remote node, the detector maintains an EWMA estimate of the inter-arrival distribution (mean and variance, independently smoothed) and emits a continuous suspicion value φ. φ is calibrated so that φ ≈ -log₁₀(P(arrival still pending)):

φ value Rough meaning
1 1-in-10 chance the node is dead
3 1-in-1000
8 1-in-100 000 000 — very likely down

Thresholding is a consumer concern. The detector does not decide whether a node is up or down; it publishes φ, and you (or the optional PhiAccrual.Threshold module) decide what crosses what line.

Why another failure detector?

The Elixir/OTP ecosystem has plenty of cluster-management libraries (libcluster, swarm, horde, partisan), but all of them use binary up/down detectors or entangle detection with membership. phi_accrual is the thing that goes alongside them: a pure detector, unopinionated about who sends heartbeats, what the topology looks like, or what to do when φ gets high.

Usage — bring your own signal

Anything that arrives from a remote node is evidence of liveness. If your app already has cross-node traffic, call observe/2 from the receive path — no extra network cost:

defmodule MyApp.Chatter do
  use GenServer

  def handle_info({:reply_from, node}, state) do
    PhiAccrual.observe(node)
    {:noreply, state}
  end
end

Then pattern-match on phi/1 to handle every result state:

case PhiAccrual.phi(:"node_a@host") do
  {:ok, phi, :steady}        -> # warm estimator, normal
  {:ok, phi, :recovering}    -> # warm estimator, absorbing a recent gap
  {:insufficient_data, n}    -> # still in bootstrap, `n` samples remaining
  {:stale, elapsed_ms}       -> # no arrival for > stale_after_ms
  {:error, :not_tracked}     -> # never observed
end

Call PhiAccrual.track(node, opts)before your first observe if you need custom per-node estimator options; otherwise the first observe auto-tracks with defaults.

Usage — reference source

If you have no existing cross-node chatter, enable the bundled DistributionPing source in config:

# config/runtime.exs
config :phi_accrual,
  distribution_ping: [interval_ms: 1_000, auto_track: true]

Each node then pings every peer every interval_ms over BEAM distribution. Cheap per-ping, but cluster cost is O(N²) — at 50 nodes and 1 s interval that's 2 500 pings/second of distribution traffic.

This source inherits HoL blocking — see limitations. The v2 UdpSource will escape it.

What happens when a node fails

Suppose :node_a@host has been heartbeating every ~1 s for a few minutes. Its estimator has mean ≈ 1 000 ms, σ ≈ 50 ms, and φ hovers around 0.3 (the median for an on-schedule arrival).

Then the node goes dark. Here is the timeline, using the default options and a threshold instance configured at suspect_at: 4.0, recover_at: 3.0:

t=0s    last heartbeat arrives. φ ≈ 0.3.
        → [:phi_accrual, :sample, :observed]  (interval_ms: ~1000)

t=1s    no new heartbeat. φ ≈ 0.3 (still on-schedule).
        → [:phi_accrual, :phi, :computed]  (periodic gauge tick)

t=2s    φ ≈ 3.5. starting to get suspicious.
        → [:phi_accrual, :phi, :computed]

t=3s    φ crosses 4.0.
        → [:phi_accrual, :phi, :computed]
        → [:phi_accrual, :threshold, :suspected]

t=10s   φ very high. state still :steady (stale_after_ms default 60 s).
        → [:phi_accrual, :phi, :computed]

t=60s   elapsed > stale_after_ms.
        → [:phi_accrual, :phi, :computed]  (state: :stale)

If :node_a@host comes back at t=15s and resumes heartbeating, the first-arrival interval of 15 000 ms exceeds recovering_threshold_ms (default 10 000). The state transitions to :recovering for the next 3 samples while the EWMA absorbs the outlier. Once φ drops below 3.0:

t=15s   first heartbeat after outage. interval = 15 000 ms.
        → [:phi_accrual, :sample, :observed]
        state becomes :recovering.

t=16s   next heartbeat. φ has fallen sharply (elapsed is small).
        → [:phi_accrual, :phi, :computed]  (state: :recovering)
        → [:phi_accrual, :threshold, :recovered]    (φ crossed 3.0 downward)

t=19s   three samples since the outlier.
        → state returns to :steady.

Nowhere in this flow does the library decide the node is "down." It just publishes φ and state labels; the Threshold module (or your own consumer) decides what to do. That separation is why the detector can be wired to a dashboard, an alert, and an automated-routing policy simultaneously with different thresholds.

Telemetry schema (v1.x stable)

Event names, measurement keys, and metadata keys are a contract. Breaking changes only in v2.

[:phi_accrual, :sample, :observed]
  measurements: %{interval_ms}
  metadata:     %{node, local_pause?}

[:phi_accrual, :phi, :computed]                  # periodic gauge stream
  measurements: %{phi, elapsed_ms}
  metadata:     %{node, state, local_pause?, confidence}
    # state ∈ [:steady, :recovering, :insufficient_data, :stale]

[:phi_accrual, :local_pause, :start]             # rising edge
  metadata:     %{kind}                          # :long_gc | :long_schedule | :busy_dist_port
[:phi_accrual, :local_pause, :stop]              # falling edge

[:phi_accrual, :overload, :shed]
  measurements: %{mailbox_len}
  metadata:     %{node}

[:phi_accrual, :source, :started]
  metadata:     %{source, interval_ms}

[:phi_accrual, :threshold, :suspected]           # emitted by Threshold module
[:phi_accrual, :threshold, :recovered]
  measurements: %{phi}
  metadata:     %{node, instance, threshold, confidence, detector_state}

Pipe these to Prometheus via telemetry_metrics_prometheus, to logs, or to your own alerting (see next section).

Wiring telemetry to Prometheus

Pull in telemetry_metrics_prometheus (or your preferred telemetry_metrics reporter) and declare the metrics you care about:

# mix.exs — add dependency
{:telemetry_metrics_prometheus, "~> 1.1"}

# In your supervision tree
children = [
  {TelemetryMetricsPrometheus,
   metrics: [
     # φ as a gauge — one series per (node, state) pair.
     Telemetry.Metrics.last_value(
       "phi_accrual.phi.computed.phi",
       event_name: [:phi_accrual, :phi, :computed],
       measurement: :phi,
       tags: [:node, :state, :confidence]
     ),

     # Counter of every heartbeat observed.
     Telemetry.Metrics.counter(
       "phi_accrual.sample.observed.count",
       event_name: [:phi_accrual, :sample, :observed],
       tags: [:node]
     ),

     # Local-pause events — correlate noise in φ with GC / HoL.
     Telemetry.Metrics.counter(
       "phi_accrual.local_pause.start.count",
       event_name: [:phi_accrual, :local_pause, :start],
       tags: [:kind]
     ),

     # Overload shedding — if this is ever non-zero in steady state,
     # tune α instead of raising :shed_threshold.
     Telemetry.Metrics.counter(
       "phi_accrual.overload.shed.count",
       event_name: [:phi_accrual, :overload, :shed],
       tags: [:node]
     ),

     # Discrete alert events from the Threshold module.
     Telemetry.Metrics.counter(
       "phi_accrual.threshold.suspected.count",
       event_name: [:phi_accrual, :threshold, :suspected],
       tags: [:node, :instance]
     ),
     Telemetry.Metrics.counter(
       "phi_accrual.threshold.recovered.count",
       event_name: [:phi_accrual, :threshold, :recovered],
       tags: [:node, :instance]
     )
   ]}
]

For ad-hoc logging, attach a handler directly:

:telemetry.attach_many(
  "phi-accrual-logger",
  [
    [:phi_accrual, :threshold, :suspected],
    [:phi_accrual, :threshold, :recovered]
  ],
  &MyApp.PhiLogger.handle/4,
  nil
)

defmodule MyApp.PhiLogger do
  require Logger

  def handle([:phi_accrual, :threshold, kind], %{phi: phi}, %{node: node}, _) do
    Logger.warning("node=#{node} #{kind} phi=#{Float.round(phi, 2)}")
  end
end

Thresholding (optional)

PhiAccrual.Threshold converts the φ gauge stream into discrete :suspected / :recovered events with hysteresis:

# In your supervision tree
children = [
  {PhiAccrual.Threshold, name: :dash, suspect_at: 4.0, recover_at: 3.0},
  {PhiAccrual.Threshold, name: :route, suspect_at: 8.0, recover_at: 7.0}
]

Multiple instances coexist — one for dashboards at φ=4, another for automated routing at φ=8. Skip the module entirely if you want to roll your own.

Configuration

# config/runtime.exs
config :phi_accrual,
  # enable the node-global :erlang.system_monitor hook (default: true).
  # Disable if another library already subscribes.
  pause_monitor: true,

  # back-pressure threshold — observe/2 sheds samples when mailbox
  # exceeds this count and emits [:overload, :shed] telemetry.
  shed_threshold: 10_000,

  # bundled reference source — off by default, opt in:
  distribution_ping: [interval_ms: 1_000, auto_track: true]

Per-node estimator options (passed to PhiAccrual.track/2):

Option Default Notes
:alpha_mean0.125 EWMA smoothing for mean
:alpha_var0.125 EWMA smoothing for variance (tune lower)
:min_std_dev_ms50.0 Floor on σ — prevents singular distribution
:min_samples8 Bootstrap gate before φ is reported
:stale_after_ms60_000 Elapsed past which state becomes :stale
:recovering_threshold_ms10_000 Large-gap detection for :recovering tag
:recovering_grace_samples3 Samples the :recovering tag persists for
:initial_interval_ms1_000 Prior mean before any observation
:initial_std_dev_ms500 Prior σ (variance = σ²)

Limitations

Read these before wiring φ to anything that takes irreversible action.

Head-of-line blocking (primary v1 caveat).DistributionPing and any source that travels over BEAM distribution shares a TCP socket with user traffic. A large GenServer reply or :pg broadcast can delay heartbeats for arbitrary periods. PauseMonitor subscribes to :busy_dist_port so you can observe this (pause telemetry + confidence: false on φ events), but the underlying problem cannot be fixed by this library while the source is distribution-based. The v2 UdpSource solves it by using a dedicated socket.

Local-pause suppression is best-effort.:erlang.system_monitor fires on :long_gc, :long_schedule, and :busy_dist_port. The monitor marks φ output with local_pause?: true and confidence: false for a short lockout window after any event. It does not freeze φ or widen the variance — we decided the silent- detector failure mode is worse than noisy φ. Consumers are expected to filter on the confidence flag (the Threshold module passes it through in metadata).

Gaussian assumption misbehaves under bimodal distributions. BEAM GC produces intermittent large pauses that, combined with normal intervals, yield a bimodal inter-arrival distribution. A Gaussian EWMA is a poor fit and will over-alert. Correlate φ with :erlang.statistics(:garbage_collection) before acting on high φ. A non-parametric estimator (Satzger or a two-component mixture) is a v2 consideration once we have real traces from deployments.

One :erlang.system_monitor per node. Only one subscription can exist. If another library installs its own, enabling both will cause one to silently win. Disable pause_monitor in config and feed pause state to PhiAccrual.PauseMonitor.put_state/1 yourself if you need coexistence.

Testing strategy

Failure detectors are hard to test against wall-clock. This project:

Versioning

v1.x is telemetry-schema-stable: event names, measurement keys, and metadata keys will not change until v2. Per-node option defaults may be tuned within v1.x.

Roadmap

v1 (shipped)

v2 (planned)

Related ideas

This library is the first of three composable primitives: φ-accrual → HLC + causal broadcast → SWIM-Lifeguard standalone.

License

Apache-2.0. See LICENSE.