nova_resilience

Production-grade resilience patterns for Nova web applications.

Bridges Nova and Seki to provide dependency health checking, Kubernetes-ready probes, circuit breakers, bulkheads, deadline propagation, and ordered graceful shutdown — all via declarative configuration.

Features

Health endpoints — /health, /ready, /live for Kubernetes probes
Startup gating — traffic held until critical dependencies are healthy
Circuit breakers — stop calling failing dependencies, allow recovery
Bulkheads — limit concurrent requests per dependency
Retry — configurable retry with exponential backoff and jitter
Deadline propagation — per-request timeouts via headers or defaults
Graceful shutdown — ordered teardown with drain, priority groups, and LB coordination
Telemetry — events for all resilience operations (calls, breakers, shutdown, health)
Pluggable adapters — built-in support for pgo, kura, brod, or custom

Quick start

Add to your deps:

{deps, [
    nova,
    nova_resilience
]}.

Add to your .app.src applications:

{applications, [kernel, stdlib, nova, nova_resilience]}.

{my_app, [
    {nova_apps, [nova_resilience]}
]}.

Configure dependencies in sys.config:

{nova_resilience, [
    {dependencies, [
        #{name => primary_db,
          type => database,
          adapter => pgo,
          pool => default,
          critical => true,
          breaker => #{failure_threshold => 5, wait_duration => 30000},
          bulkhead => #{max_concurrent => 25},
          shutdown_priority => 2}
    ]}
]}.

That’s it. Your app now has /health, /ready, and /live endpoints, automatic startup gating, circuit breakers, bulkheads, and ordered shutdown.

How it works

Startup

App starts, nova_resilience provisions seki primitives for each dependency
Health checks run — /ready returns 503 until all critical deps are healthy
Kubernetes readiness probe holds traffic until ready
Once healthy, /ready returns 200 and traffic flows

Running

Wrap calls to external dependencies through the resilience stack:

case nova_resilience:call(primary_db, fun() ->
    pgo:query(~"SELECT * FROM users WHERE id = $1", [Id])
end) of
    {ok, #{rows := Rows}} ->
        {json, #{users => Rows}};
    {error, circuit_open} ->
        {json, 503, #{}, #{error => ~"db unavailable"}};
    {error, bulkhead_full} ->
        {json, 503, #{}, #{error => ~"overloaded"}};
    {error, deadline_exceeded} ->
        {json, 504, #{}, #{error => ~"timeout"}}
end.

Shutdown

On SIGTERM (or application stop):

/ready immediately returns 503 (load balancer stops sending traffic)
Waits shutdown_delay for LB health checks to propagate
Drains in-flight requests (monitors bulkhead occupancy)
Tears down dependencies in shutdown_priority order
Nova drains HTTP connections and stops

No manual prep_stop calls needed — shutdown is fully automatic.

Health endpoints

Endpoint	Purpose	Response
`GET /health`	Full diagnostic report	`{"status":"healthy","dependencies":{...},"vm":{...}}`
`GET /ready`	Kubernetes readiness probe	200 when ready, 503 when not
`GET /live`	Kubernetes liveness probe	200 if process is responsive

The /health endpoint returns per-dependency status with circuit breaker state, bulkhead occupancy, and VM metrics (memory, process count, run queue, uptime, node).

Configuration

Application environment

{nova_resilience, [
    {dependencies, [...]},            %% List of dependency configs
    {health_check_interval, 10000},   %% ms between health checks
    {vm_checks, true},                %% Include BEAM VM info in health report
    {gate_enabled, true},             %% false to skip startup gating (dev/test)
    {gate_timeout, 30000},            %% Max ms to wait for deps on startup
    {gate_check_interval, 1000},      %% ms between gate readiness checks
    {health_severity, info},          %% critical: /health returns 503 when unhealthy
    {shutdown_delay, 5000},           %% ms to wait after marking not-ready
    {shutdown_drain_timeout, 15000},  %% Max ms to drain per priority group
    {drain_poll_interval, 100},       %% ms between drain occupancy polls
    {health_prefix, ~""}              %% Prefix for health routes (e.g. ~"/internal")
]}.

Unknown config keys are logged as warnings on startup to catch typos.

Dependency config

#{
    name => atom(),                    %% Required — unique identifier
    type => database | kafka | custom, %% Optional — infers adapter
    adapter => pgo | kura | brod | module(), %% Optional — inferred from type
    critical => boolean(),             %% Default: false — gates /ready
    shutdown_priority => non_neg_integer(), %% Default: 10 — lower = first
    default_timeout => pos_integer(),  %% Default deadline in ms
    health_check => {module(), function()}, %% Override adapter health check

    %% Circuit breaker
    breaker => #{
        failure_threshold => pos_integer(),
        wait_duration => pos_integer(),
        slow_call_duration => pos_integer(),
        half_open_requests => pos_integer()
    },

    %% Concurrency limiter
    bulkhead => #{
        max_concurrent => pos_integer()
    },

    %% Retry with backoff
    retry => #{
        max_attempts => pos_integer(),
        base_delay => non_neg_integer(),
        max_delay => non_neg_integer()
    }
}

Built-in adapters

Type	Adapter	Health check	Shutdown
`database`	`pgo` (default)	`SELECT 1` via pgo pool	no-op
`database`	`kura`	`SELECT 1` via kura repo	no-op
`kafka`	`brod`	`brod:get_partitions_count/2`	`brod:stop_client/1`
any	custom module	`nova_resilience_adapter` behaviour	custom

Guides

Getting Started — Installation and basic setup
Circuit Breakers & Bulkheads — Protecting dependencies
Deadline Propagation — Per-request timeout budgets
Adapters — Built-in and custom adapters
Graceful Shutdown — Ordered teardown and Kubernetes integration
Telemetry — Observability and monitoring

License

Apache-2.0