ExAtlas

Hex.pmDocsLicense

A composable, pluggable Elixir SDK for infrastructure management. Two concerns under one roof:

  1. GPU / CPU compute across cloud providers. Spawn pods, run serverless inference, orchestrate transient per-user GPU sessions. Swap providers by changing one option.
  2. Fly.io platform operations. First-class deploys, log streaming, and token lifecycle — independent of the compute pipeline. See ExAtlas.Fly and the Fly guide.

Table of contents


Installation

The one-liner — uses the Igniter installer to add the dep, write sensible config, and create storage directories:

mix igniter.install ex_atlas

Or add manually to mix.exs:

def deps do
  [
    {:ex_atlas, "~> 0.2"}
  ]
end

…then run mix ex_atlas.install once to wire config defaults, or configure things yourself (see Configuration).

For the optional orchestrator + LiveDashboard features, also include:

{:phoenix_pubsub, "~> 2.1"},           # PubSub broadcasts from the orchestrator
{:phoenix_live_dashboard, "~> 0.8"}    # ExAtlas.LiveDashboard.ComputePage tab

ExAtlas declares both as optional: true, so they are not pulled into pure library consumers.

Upgrading

To upgrade atlas and run any version-specific migrations:

mix deps.update atlas
mix ex_atlas.upgrade

The upgrade task is idempotent and runs only the steps needed between your previous and current atlas version.

Architecture at a glance

┌───────────────────────────────────────────────────────────────────────┐
│  ExAtlas (top-level provider-agnostic API)                              │
│  ExAtlas.spawn_compute/1 · run_job/2 · stream_job/1 · terminate/1       │
└───────────────────────────┬───────────────────────────────────────────┘
                            │
            ┌───────────────▼───────────────┐    ┌───────────────────┐
            │  ExAtlas.Provider (behaviour)   │◄───│  ExAtlas.Spec.*     │
            └───────────────┬───────────────┘    │  normalized structs│
                            │                    └───────────────────┘
    ┌─────────┬─────────────┼──────────────┬─────────────┐
    │         │             │              │             │
 ┌──▼───┐ ┌──▼───┐ ┌───────▼────────┐ ┌──▼─────┐ ┌──────▼──────┐
 │RunPod│ │ Fly  │ │  Lambda Labs   │ │ Vast   │ │  Mock (test)│
 │ v0.1 │ │ v0.2 │ │     v0.2       │ │ v0.3   │ │    v0.1     │
 └──────┘ └──────┘ └────────────────┘ └────────┘ └─────────────┘

┌───────────────────────────────────────────────────────────────────────┐
│  ExAtlas.Orchestrator (opt-in supervision tree)                         │
│  ComputeServer (GenServer/resource) · Registry · DynamicSupervisor    │
│  · Reaper · PubSub events                                             │
└───────────────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────────────┐
│  ExAtlas.Auth                                                           │
│  Token (bearer mint/verify) · SignedUrl (S3-style HMAC)               │
└───────────────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────────────┐
│  ExAtlas.LiveDashboard.ComputePage                                      │
│  Live-refreshing table · per-row Touch/Stop/Terminate                 │
└───────────────────────────────────────────────────────────────────────┘

Quick start — Fly.io platform ops

ExAtlas gives you a clean Elixir API over fly deploy, the Fly Machines log API, and Fly token lifecycle. Works with or without Phoenix.

Discover apps

ExAtlas.Fly.discover_apps("/path/to/project")
# => [{"my-api", "/path/to/project"}, {"my-web", "/path/to/project/web"}]

Tail logs

ExAtlas.Fly.subscribe_logs("my-api", "/path/to/project")

# In the subscriber:
def handle_info({:ex_atlas_fly_logs, "my-api", entries}, state) do
  # entries :: [ExAtlas.Fly.Logs.LogEntry.t()]
  ...
end

A single streamer runs per app regardless of subscriber count, and stops once all subscribers disconnect. Automatic 401 retry is built in.

Stream a deploy

ExAtlas.Fly.subscribe_deploy(ticket_id)
Task.start(fn ->
  ExAtlas.Fly.stream_deploy(project_path, "web", ticket_id)
end)

def handle_info({:ex_atlas_fly_deploy, ^ticket_id, line}, state) do
  ...
end

Deploys are guarded by a 5 min activity timer (resets on output) and a 30 min absolute cap.

Tokens

ExAtlas.Fly.Tokens resolves tokens via ETS → DETS (durable) → ~/.fly/config.ymlfly tokens create readonly → manual override. You usually don't call it directly — the log client uses it transparently — but you can:

{:ok, token} = ExAtlas.Fly.Tokens.get("my-api")
ExAtlas.Fly.Tokens.invalidate("my-api")
ExAtlas.Fly.Tokens.set_manual("my-api", "fo1_...")

Full docs: Fly guide.

Quick start — transient per-user GPU pod

The motivating use case: a Fly.io-hosted Phoenix app spawns a RunPod GPU per user, hands the browser a preshared key, the browser runs real-time video inference directly against the pod, and ExAtlas reaps the pod when the session ends or goes idle.

# config/config.exs
config :ex_atlas, default_provider: :runpod
config :ex_atlas, :runpod, api_key: System.get_env("RUNPOD_API_KEY")
config :ex_atlas, start_orchestrator: true
# LiveView.mount/3
{:ok, pid, compute} =
  ExAtlas.Orchestrator.spawn(
    gpu: :h100,
    image: "ghcr.io/me/my-inference-server:latest",
    ports: [{8000, :http}],
    auth: :bearer,
    user_id: socket.assigns.current_user.id,
    idle_ttl_ms: 15 * 60_000,
    name: "atlas-" <> to_string(socket.assigns.current_user.id)
  )

Phoenix.PubSub.subscribe(ExAtlas.PubSub, "compute:" <> compute.id)

assign(socket,
  inference_url: hd(compute.ports).url,       # https://<pod-id>-8000.proxy.runpod.net
  inference_token: compute.auth.token         # handed straight to the browser
)

Inside the inference server running in the pod:

# Any request from the browser must carry the preshared key.
def authenticated?(conn) do
  preshared = System.fetch_env!("ATLAS_PRESHARED_KEY")

  case Plug.Conn.get_req_header(conn, "authorization") do
    ["Bearer " <> token] -> Plug.Crypto.secure_compare(token, preshared)
    _ -> false
  end
end

Heartbeat while the browser is active:

ExAtlas.Orchestrator.touch(compute.id)

When the user leaves, or after idle_ttl_ms with no heartbeat, the ComputeServer shuts down and terminates the upstream pod automatically. You can also terminate manually:

:ok = ExAtlas.Orchestrator.stop_tracked(compute.id)

Quick start — serverless inference

{:ok, job} =
  ExAtlas.run_job(
    provider: :runpod,
    endpoint: "abc123",
    input: %{prompt: "a beautiful sunset"},
    mode: :async
  )

{:ok, done} = ExAtlas.get_job(job.id, provider: :runpod, endpoint: "abc123")
done.output

# Synchronous with a hard timeout (wrapped in Task.async + Task.yield internally)
{:ok, done} =
  ExAtlas.run_job(
    provider: :runpod,
    endpoint: "abc123",
    input: %{prompt: "a beautiful sunset"},
    mode: :sync,
    timeout_ms: 60_000
  )

# Stream partial output
ExAtlas.stream_job(job.id, provider: :runpod, endpoint: "abc123")
|> Enum.each(&IO.inspect/1)

Swapping providers

# Today
ExAtlas.spawn_compute(provider: :runpod,      gpu: :h100, image: "...")

# v0.2
ExAtlas.spawn_compute(provider: :fly,         gpu: :a100_80g, image: "...")
ExAtlas.spawn_compute(provider: :lambda_labs, gpu: :h100, image: "...")

# v0.3
ExAtlas.spawn_compute(provider: :vast,        gpu: :rtx_4090, image: "...")

# Your in-house cloud, today:
ExAtlas.spawn_compute(provider: MyCompany.Cloud.Provider, gpu: :h100, image: "...")

All built-in and user-defined providers implement ExAtlas.Provider.

Configuration

# config/config.exs

# Provider resolution: per-call :provider option > :default_provider > raise
config :ex_atlas, default_provider: :runpod

# API keys: per-call :api_key > :ex_atlas / :<provider> config > env var
config :ex_atlas, :runpod,      api_key: System.get_env("RUNPOD_API_KEY")
config :ex_atlas, :fly,         api_key: System.get_env("FLY_API_TOKEN")
config :ex_atlas, :lambda_labs, api_key: System.get_env("LAMBDA_LABS_API_KEY")
config :ex_atlas, :vast,        api_key: System.get_env("VAST_API_KEY")

# Start the orchestrator (Registry + DynamicSupervisor + PubSub + Reaper).
# When false (default), ExAtlas boots no processes.
config :ex_atlas, start_orchestrator: true

# Reaper: periodic orphan reconciliation and idle-TTL enforcement.
config :ex_atlas, :orchestrator,
  reap_interval_ms: 60_000,
  reap_providers: [:runpod],
  reap_name_prefix: "atlas-"     # safety switch: only reap resources ExAtlas spawned

Default environment variable names used when nothing else is set:

Provider Env var
:runpodRUNPOD_API_KEY
:flyFLY_API_TOKEN
:lambda_labsLAMBDA_LABS_API_KEY
:vastVAST_API_KEY

Providers

Provider Module Version shipped Capabilities
:runpodExAtlas.Providers.RunPod v0.1 :spot, :serverless, :network_volumes, :http_proxy, :raw_tcp, :symmetric_ports, :webhooks, :global_networking
:flyExAtlas.Providers.Fly v0.2 (stub) :http_proxy, :raw_tcp, :global_networking
:lambda_labsExAtlas.Providers.LambdaLabs v0.2 (stub) :raw_tcp
:vastExAtlas.Providers.Vast v0.3 (stub) :spot, :raw_tcp
:mockExAtlas.Providers.Mock v0.1 (tests) :spot, :serverless, :network_volumes, :http_proxy, :raw_tcp, :webhooks

Stub modules return {:error, %ExAtlas.Error{kind: :unsupported}} from every non-capabilities/0 callback so the name is reserved and callers get a clear error — no FunctionClauseErrors.

Canonical GPU atoms

ExAtlas refers to GPUs by stable atoms. ExAtlas.Spec.GpuCatalog maps each atom to each provider's native identifier.

Canonical RunPod Lambda Labs Fly.io Vast.ai
:h200"NVIDIA H200""H200"
:h100"NVIDIA H100 80GB HBM3""gpu_1x_h100_pcie""H100"
:a100_80g"NVIDIA A100 80GB PCIe""gpu_1x_a100_sxm4_80gb""a100-80gb""A100_80GB"
:a100_40g"NVIDIA A100-SXM4-40GB""gpu_1x_a100_sxm4""a100-pcie-40gb""A100"
:l40s"NVIDIA L40S""l40s"
:l4"NVIDIA L4"
:a6000"NVIDIA RTX A6000""gpu_1x_a6000""RTX_A6000"
:rtx_4090"NVIDIA GeForce RTX 4090""RTX_4090"
:rtx_3090"NVIDIA GeForce RTX 3090""RTX_3090"
:mi300x"AMD Instinct MI300X OAM"

See ExAtlas.Spec.GpuCatalog for the full mapping.

The ExAtlas.Provider behaviour

Every provider implements one callback per operation. See ExAtlas.Provider for the full contract.

Callback Purpose
spawn_compute/2 Provision a GPU/CPU resource
get_compute/2 Fetch current status
list_compute/2 List with optional filters
stop/2 / start/2 Pause / resume
terminate/2 Destroy
run_job/2 Submit a serverless job
get_job/2 / cancel_job/2 Job control
stream_job/2 Stream partial outputs
capabilities/0 Declare supported features
list_gpu_types/1 Catalog + pricing

Callers can check ExAtlas.capabilities(:runpod) before relying on an optional feature:

if :serverless in ExAtlas.capabilities(provider) do
  ExAtlas.run_job(provider: provider, endpoint: "...", input: %{...})
end

Capability atoms

Atom Meaning
:spot Interruptible/spot instances
:serverlessrun_job/2 and friends
:network_volumes Attach persistent volumes
:http_proxy Provider terminates TLS on a *.proxy.* hostname
:raw_tcp Public IP + mapped TCP ports
:symmetric_portsinternal == external port guarantee
:webhooks Push completion callbacks
:global_networking Private networking across datacenters

Normalized specs (ExAtlas.Spec.*)

Requests and responses flow through normalized structs so callers don't have to know each provider's native shape.

Every spec struct has a :raw field preserving the provider's native response for callers who need fields ExAtlas hasn't yet normalized.

The :provider_opts field on request structs is the escape hatch for provider-specific options ExAtlas doesn't model — values are stringified and merged into the outgoing REST body.

Auth primitives

ExAtlas.Auth.Token and ExAtlas.Auth.SignedUrl are exposed directly if you want them without the rest of the orchestration layer.

Bearer tokens

mint = ExAtlas.Auth.Token.mint()
# %{
#   token: "kX9fP...",                              # hand to client once
#   hash:  "4c1...",                                # persist this
#   header: "Authorization: Bearer kX9fP...",
#   env:   %{"ATLAS_PRESHARED_KEY" => "kX9fP..."}   # inject into the pod
# }

ExAtlas.Auth.Token.valid?(candidate, mint.hash)

When you pass auth: :bearer to spawn_compute/1, ExAtlas mints a token, adds it to the pod's env as ATLAS_PRESHARED_KEY, and returns the handle in compute.auth — all in one round-trip.

S3-style signed URLs

For <video src>, <img src>, or any client that can't set request headers:

url =
  ExAtlas.Auth.SignedUrl.sign(
    "https://pod-id-8000.proxy.runpod.net/stream",
    secret: signing_secret,
    expires_in: 3600
  )

:ok = ExAtlas.Auth.SignedUrl.verify(url, secret: signing_secret)

The signature covers the path + canonicalized query + expiry with HMAC-SHA256; verification uses constant-time comparison.

Orchestrator — lifecycle, events, reaper

ExAtlas.Orchestrator.spawn/1

Spawns the resource via the provider, then starts an ExAtlas.Orchestrator.ComputeServer under ExAtlas.Orchestrator.ComputeSupervisor that:

  1. Registers itself in ExAtlas.Orchestrator.ComputeRegistry under {:compute, id}.
  2. Traps exits — its terminate/2 always calls ExAtlas.terminate/2 on the upstream provider, whether the supervisor shuts it down or it exits on an idle timeout.
  3. Tracks :last_activity_ms and compares against :idle_ttl_ms on every heartbeat tick. If idle, the server stops normally and the upstream resource is destroyed.

PubSub events

Every state change is broadcast over ExAtlas.PubSub on the topic "compute:<id>" as {:atlas_compute, id, event}:

Event Emitted when
{:status, :running}ComputeServer starts
{:heartbeat, monotonic_ms} Heartbeat tick (no idle timeout)
{:terminating, reason} Server is about to shut down
{:status, :terminated} Upstream provider confirmed termination
{:terminate_failed, error} Upstream terminate call returned an error

Subscribe in a LiveView:

Phoenix.PubSub.subscribe(ExAtlas.PubSub, "compute:" <> compute.id)

def handle_info({:atlas_compute, _id, {:status, :terminated}}, socket) do
  {:noreply, put_flash(socket, :info, "Session ended")}
end

Reaper

ExAtlas.Orchestrator.Reaper runs periodically (configurable, default 60s) and:

  1. Lists each configured provider's running resources.
  2. Compares against the resources tracked by the local ComputeRegistry.
  3. Terminates any orphan whose :name starts with :reap_name_prefix (default "atlas-").

The prefix is a safety switch so ExAtlas never touches pods created by other tools on the same cloud account. Set it to "" to disable.

Phoenix LiveDashboard integration

If your Phoenix app already mounts Phoenix.LiveDashboard, adding an ExAtlas tab is a one-liner — the library ships ExAtlas.LiveDashboard.ComputePage:

# lib/my_app_web/router.ex
import Phoenix.LiveDashboard.Router

live_dashboard "/dashboard",
  metrics: MyAppWeb.Telemetry,
  allow_destructive_actions: true,   # required for Stop/Terminate buttons
  additional_pages: [
    atlas: ExAtlas.LiveDashboard.ComputePage
  ]

Visit /dashboard/atlas to see a live-refreshing table of every tracked compute resource with per-row Touch, Stop, and Terminate controls. The page is only compiled when :phoenix_live_dashboard is in your deps (both LiveDashboard and LiveView are declared as optional: true in ExAtlas, so library-only users pay nothing).

HTTP layer + telemetry

Every provider uses Req under the hood:

Telemetry events

Every request emits [:ex_atlas, <provider>, :request]:

Measurement Value
status HTTP status code
Metadata Value
api:management / :runtime / :graphql
method:get / :post / :delete / ...
url Full request URL

Wire into your existing telemetry pipeline:

:telemetry.attach(
  "atlas-http-logger",
  [:ex_atlas, :runpod, :request],
  fn _event, measurements, metadata, _ ->
    Logger.info("ExAtlas → RunPod #{metadata.method} #{metadata.url}#{measurements.status}")
  end,
  nil
)

Per-call Req overrides

Any option accepted by Req.new/1 can be passed via req_options::

ExAtlas.spawn_compute(
  provider: :runpod,
  gpu: :h100,
  image: "...",
  req_options: [receive_timeout: 60_000, max_retries: 5, plug: MyPlug]
)

Error handling

All provider callbacks return {:ok, value} or {:error, %ExAtlas.Error{}}. The error struct has a stable :kind atom you can pattern-match on:

Kind When it happens
:unauthorized Bad or missing API key (HTTP 401)
:forbidden API key lacks permission (HTTP 403)
:not_found Resource doesn't exist (HTTP 404)
:rate_limited Provider 429
:timeout Client-side timeout (e.g. run_sync over cap)
:unsupported Provider lacks this capability
:validation ExAtlas-side validation (e.g. missing :endpoint)
:provider Provider-reported 4xx/5xx with no finer bucket
:transport HTTP/socket failure
:unknown Anything else
case ExAtlas.spawn_compute(provider: :runpod, gpu: :h100, image: "...") do
  {:ok, compute} -> ...
  {:error, %ExAtlas.Error{kind: :unauthorized}} -> rotate_key()
  {:error, %ExAtlas.Error{kind: :rate_limited}} -> backoff()
  {:error, err} -> Logger.error(Exception.message(err))
end

Writing your own provider

defmodule MyCloud.Provider do
  @behaviour ExAtlas.Provider

  @impl true
  def capabilities, do: [:http_proxy]

  @impl true
  def spawn_compute(%ExAtlas.Spec.ComputeRequest{} = req, ctx) do
    # translate `req` into your cloud&#39;s native payload,
    # POST it with Req, normalize the response into %ExAtlas.Spec.Compute{}
  end

  # ... implement the other callbacks ...
end

# Use it without any further configuration:
ExAtlas.spawn_compute(provider: MyCloud.Provider, gpu: :h100, image: "...")

Register it with a short atom by mapping it in your own code — ExAtlas accepts modules directly, so the atom is a convenience:

defmodule MyApp.ExAtlas do
  defdelegate spawn_compute(opts), to: ExAtlas
  # Or wrap ExAtlas and inject a default provider module
end

Testing

The ExAtlas.Test.ProviderConformance macro runs a shared ExUnit suite against any provider implementation:

defmodule MyCloud.ProviderTest do
  use ExUnit.Case, async: false

  use ExAtlas.Test.ProviderConformance,
    provider: MyCloud.Provider,
    reset: {MyCloud.TestHelpers, :reset_fixtures, []}
end

For unit tests that don't actually talk to a cloud, use the built-in ExAtlas.Providers.Mock:

setup do
  ExAtlas.Providers.Mock.reset()
  :ok
end

test "my code is provider-agnostic" do
  {:ok, compute} = MyApp.do_work(provider: :mock)
  assert compute.status == :running
end

RunPod tests against the live cloud are tagged @tag :live and are excluded from mix test by default — set RUNPOD_API_KEY and run mix test --only live to enable them.

Security considerations

Troubleshooting & FAQ

Q: (RuntimeError) ExAtlas.Orchestrator is not started You didn't set config :ex_atlas, start_orchestrator: true. The orchestrator is opt-in.

Q: {:error, %ExAtlas.Error{kind: :unauthorized}} on every RunPod call Your API key is missing or wrong. Check the resolution order: per-call api_key:config :ex_atlas, :runpod, api_key:RUNPOD_API_KEY env var.

Q: get_job/2 returns {:error, :validation, message: "requires :endpoint"} RunPod's serverless API is scoped to an endpoint id. Pass it: ExAtlas.get_job(job.id, provider: :runpod, endpoint: "abc123").

Q: My LiveDashboard ExAtlas tab is empty. Either the orchestrator isn't running, or nothing has been spawned with ExAtlas.Orchestrator.spawn/1. Non-tracked resources (spawned via ExAtlas.spawn_compute/1 directly) don't show in the table — they're not under supervision.

Q: Stop/Terminate buttons don't show. Set allow_destructive_actions: true on the live_dashboard call.

Q: I want to use ExAtlas with httpc / Mint / Finch directly instead of Req. Rewrite the provider module, or pass a custom Req adapter via req_options: [adapter: my_adapter]. The ExAtlas.Provider contract doesn't mandate Req — it's an implementation choice of the bundled providers.

Roadmap

All future providers will be additive; adding a provider never breaks existing call sites.

Contributing

PRs welcome. Before opening:

mix format
mix compile --warnings-as-errors
mix test
mix docs              # verify docstrings render

For new providers, the shared conformance suite (test/support/provider_conformance.ex) must pass against your module.

License

Apache-2.0. See LICENSE.