DistAgent

Elixir framework to run distributed, fault-tolerant variant of Agent.

Hex.pmBuild StatusCoverage Status

Overview

dist_agent is an Elixir library (or framework) to run many “distributed agent”s within a cluster of ErlangVM nodes. “Distributed agent” has the followings in common with Elixir’s Agent:

On the other hand, “distributed agent” offers the following features:

Concepts

Design

Raft protocol and libraries

dist_agent heavily depends on Raft consensus protocol for synchronous replication and failover. The core protocol is implemented in rafted_value and the cluster management and fault tolerance mechanism are provided by raft_fleet.

Although Raft consensus groups provide an important building block for distributed agents, it’s unclear how we should map the concept of “distributed agent”s to consensus groups. It can be easily seen that the following 2 extremes are not optimal for wide range of use cases:

And (of course) number of distributed agents in a system changes over time. We take an approach that

This dynamic “sharding” of distributed agents and also the agent ID-based data model are defined by raft_kv. This design may introduce a potential problem: a distributed agent can be blocked by a long-running operation of another agent which happened to reside in the same consensus group. It is the responsibility of implementers of the callback modules for distributed agents to ensure that handlers of query/command/timeout don’t take long time.

Even with reduced number of consensus groups explained above, state replications and healthchecks involve high rate of inter-node communications. In order to reduce network traffic and TCP overhead (with increased latency), remote communications between nodes can be batched with the help of batched_communication. It’s not included as a dependency of dist_agent; to use it you have to add it as a dependency of your project and set BatchedCommunication module as the following options:

Since establishing a consensus (committing a command) in the Raft protocol requires round trips to remote nodes, it is a relatively expensive operation. In order not to overwhelm raft member processes, accesses to each agent may be rate-limited by the token bucket algorithm. Rate limiting is (when enabled) imposed on a per-node basis; in each node, there exists a bucket per distributed agent. We use foretoken as the token bucket implementation.

Quota management

Current statuses of all quotas are managed by a special Raft consensus group named DistAgent.Quota. It’s internal state consists of

When adding a new distributed agent, the upper limit is checked by consulting with this Raft consensus group. %{quota_id => count} reported from each node is valid for 15 minutes. Counts in removed/unreachable nodes are thus automatically cleaned up.

In each node a GenServer named DistAgent.Quota.Reporter periodically aggregates number of distributed agents queried from consensus leader processes that reside in the node. It periodically publishes the aggregated value to DistAgent.Quota.

Quota is checked only when making a new distributed agent, i.e., on receipt of 1st message to a distributed agent, the quota limit violation is checked. Already created distributed agent is never blocked/stopped due to quota limit. Especially agent migration and failover won’t be affected.

Things dist_agent won’t do

Currently we have no plan to: