RpcLoadBalancer

Hex version badgeTestCredoDialyzerCoverage

An Elixir library for executing Remote Procedure Calls across distributed BEAM nodes with a built-in load balancer. It wraps Erlang's :erpc module with structured error handling and provides a pluggable node selection layer powered by OTP's :pg process groups.

Why This Exists

OTP's built-in node connection list (Node.list/0) does not automatically remove nodes that have crashed or become unreachable — they linger until the net kernel detects the failure, which can take seconds or longer depending on heartbeat configuration. During that window, any RPC call routed to the stale node will hang until it times out.

This library solves the problem by using :pg process groups instead of the raw node list. When a node goes down, its process group members are removed immediately because the backing processes exit. The load balancer only ever selects from nodes that have a live, registered process, so stale entries are never returned.

This gives you:

Features

Installation

Add rpc_load_balancer to your dependencies:

def deps do
  [
    {:rpc_load_balancer, "~> 0.1.0"}
  ]
end

Quick Start

Direct RPC

{:ok, result} =
  RpcLoadBalancer.call(
    :"worker@host",
    MyModule,
    :some_fun,
    ["arg"],
    timeout: :timer.seconds(5)
  )

:ok = RpcLoadBalancer.cast(:"worker@host", MyModule, :some_fun, ["arg"])

Load-Balanced RPC

Start a load balancer, then route calls through it with the :load_balancer option:

{:ok, _pid} =
  RpcLoadBalancer.start_link(
    name: :my_balancer,
    selection_algorithm: RpcLoadBalancer.LoadBalancer.SelectionAlgorithm.RoundRobin
  )

{:ok, result} =
  RpcLoadBalancer.call(node(), MyModule, :my_fun, [arg], load_balancer: :my_balancer)

When the :load_balancer option is present, the first argument (node) is ignored — the balancer selects the target node for you.

Supervision tree ordering: The load balancer should be the last child in your supervision tree. OTP shuts down children in reverse order, so placing it last means it shuts down first during deployment — the node deregisters from the :pg group and drains in-flight calls before your application logic stops.

children = [
  MyApp.Repo,
  MyApp.Endpoint,
  {RpcLoadBalancer,
   name: :my_balancer,
   selection_algorithm: RpcLoadBalancer.LoadBalancer.SelectionAlgorithm.RoundRobin}
]

Algorithms

Algorithm Description Selection cost (8 nodes) Scales with cluster size
CallDirect Executes locally via apply/3, bypassing :erpc — ideal for tests 0.04 μs / 23.4 M ips constant
Random Picks a random node (default) 0.11 μs / 8.9 M ips constant
RoundRobin Cycles through nodes with an atomic counter 0.81 μs / 1.24 M ips constant
PowerOfTwo Picks 2 random nodes, chooses the one with fewer connections 2.56 μs / 391 K ips sub-linear (2 ETS reads)
WeightedRoundRobin Round robin with configurable per-node weights 2.78 μs / 359 K ips sub-linear (rebuilds expanded list)
LeastConnections Selects the node with fewest active connections 5.87 μs / 170 K ips linear (reads every node's counter)
LeastCpu Selects the node with the lowest cached CPU utilization 17.4 μs / 57 K ips linear (reads every node's CPU entry)
HashRing Consistent hash-based routing via a :key option 28.7 μs / 35 K ips constant (cached libring lookup)

Numbers from mix run bench/select_node_bench.exs on Apple M1 Max, Elixir 1.19.5 / OTP 28.3.3, 8-node synthetic cluster, single process. Full per-cluster-size numbers, parallel-contention results, and identified bottlenecks are in bench/README.md.

Picking an algorithm. Use Random or RoundRobin unless you have a reason not to — they're the cheapest and distribute load uniformly. Reach for PowerOfTwo over LeastConnections once the cluster is large enough that linear scanning matters (~8+ nodes); it gives near-identical distribution at a fraction of the cost. HashRing is the most expensive single-process choice and the worst under parallel read contention — only use it when you actually need key affinity.

Configuration

All values are optional and can be set via application config:

config :rpc_load_balancer,
  call_directly?: false,
  retry?: true,
  retry_count: 5
Key Default Description
:call_directly?false Execute all load-balanced calls locally via apply/3
:retry?true Enable automatic retry when no nodes are available
:retry_count5 Maximum number of retries

Testing

In tests you typically don't have a multi-node cluster. Use the CallDirect algorithm so the load balancer executes calls locally instead of through :erpc:

{:ok, _pid} =
  RpcLoadBalancer.start_link(
    name: :my_balancer,
    selection_algorithm: RpcLoadBalancer.LoadBalancer.SelectionAlgorithm.CallDirect
  )

{:ok, result} =
  RpcLoadBalancer.call(node(), MyModule, :my_fun, [arg], load_balancer: :my_balancer)

To switch automatically based on environment, use a compile-time module attribute:

@selection_algorithm if Mix.env() === :test,
                       do: RpcLoadBalancer.LoadBalancer.SelectionAlgorithm.CallDirect,
                       else: RpcLoadBalancer.LoadBalancer.SelectionAlgorithm.RoundRobin

See the Testing with CallDirect how-to guide for full examples.

Documentation

This project's documentation follows the Diátaxis framework:

License

MIT — see LICENSE for details.