ExTorch

Elixir bindings for libtorch -- production ML model serving on the BEAM.

Train in Python, serve from Elixir. ExTorch runs PyTorch models with OTP fault tolerance, beating Python's own inference performance by 1.35x on average.

Why ExTorch?

Faster than Python. The pre-compiled graph executor beats Python's FX interpreter on every tested model -- 1.35x faster on average, bit-for-bit identical outputs.

Model Python Export ExTorch Compiled Speedup
ResNet50 7.21ms 4.96ms 1.45x
MobileNetV2 6.56ms 4.07ms 1.61x
ViT-B/16 9.53ms 9.46ms 1.01x
SqueezeNet 2.77ms 1.98ms 1.40x
DistilBERT 0.78ms 0.59ms 1.32x

RTX 3060, median latency, 30 iterations. Full results for 12 models in examples/models.

Four inference paths for every use case:

Path Use case ViT-B/16 latency
forward/2 Debug, profile, op-by-op introspection 54.9ms
forward_native/2 Production, single NIF call 11.9ms
forward_compiled/2 Pre-compiled, fastest Export path 9.5ms
ExTorch.AOTI Compiled kernels, maximum throughput 8.8ms

Production-ready serving. GenServer model pools, telemetry events, ETS-backed metrics, zero-downtime hot model reload -- not bolted on, designed in.

Extensible op ecosystem. The generic c10::Dispatcher bridge lets pure-Elixir packages register new ops without C++ code. ExTorch.Vision adds torchvision ops (NMS, ROI Align, deformable conv, image I/O) this way.

Zero-copy with Nx. Share tensor memory between ExTorch and Nx/Torchx via raw pointer exchange -- no data copying.

Bit-for-bit accurate. All inference paths produce identical outputs to Python (verified across 11 models, 3 paths each, max absolute error = 0.0).

Features

Requirements

Installation

Add extorch to your dependencies in mix.exs:

def deps do
  [
    {:extorch, "~> 0.3.0"}
  ]
end

ExTorch downloads libtorch automatically on first compile. To use a local installation:

config :extorch, libtorch: [
  version: :local,
  folder: :python  # or an absolute path to libtorch
]

Quick Start

Train in Python, Serve from Elixir

# Python: export your model
import torch
model = torchvision.models.resnet50(pretrained=True).eval()
exported = torch.export.export(model, (torch.randn(1, 3, 224, 224),))
torch.export.save(exported, "resnet50.pt2")
# Elixir: load and serve
model = ExTorch.Export.load("resnet50.pt2", device: :cuda)
input = ExTorch.Tensor.to(ExTorch.randn({1, 3, 224, 224}), device: :cuda)

# Fastest path — pre-compiled graph, zero per-op overhead
output = ExTorch.Export.forward_compiled(model, [input])

# Or use AOTI for maximum throughput (requires pre-compilation in Python)
aoti_model = ExTorch.AOTI.load("resnet50_aoti.pt2", device_index: 0)
[output] = ExTorch.AOTI.forward(aoti_model, [input])

Production Serving with GenServer

# Supervised model server with telemetry
{:ok, _} = ExTorch.Export.Server.start_link(
  path: "resnet50.pt2",
  device: :cuda,
  name: :resnet
)

# Thread-safe inference
{:ok, output} = ExTorch.Export.Server.predict(:resnet, [input])

# Monitor performance
ExTorch.Metrics.setup()
ExTorch.Metrics.get("resnet50.pt2")
# => %{inference_count: 1500, min_duration_ms: 4.9, max_duration_ms: 12.1, ...}

Hot Model Reload

# Swap models without dropping requests
# See examples/serving/hot_reload.exs for the full pattern
GenServer.cast(:resnet, {:reload, "resnet50_v2.pt2"})
# In-flight requests complete on old model, new requests use new model

Extending with Custom Ops

# Load torchvision ops (NMS, ROI Align, etc.)
ExTorch.Native.load_torch_library("/path/to/libtorchvision.so")

# Call any registered op by name
keep = ExTorch.Native.dispatch_op("torchvision::nms", "", [
  {:tensor, boxes}, {:tensor, scores}, {:float, 0.5}
])

# Or use ExTorch.Vision for a clean API
ExTorch.Vision.nms(boxes, scores, 0.5)
ExTorch.Vision.roi_align(features, rois, 1.0, 7, 7)

Zero-Copy Tensor Exchange with Nx

# ExTorch → Nx (via Torchx): share memory, no copy
blob = ExTorch.Tensor.Blob.to_blob(tensor)
# => %Blob{ptr: 140234567890, shape: {3, 224, 224}, dtype: :float, ...}

# Nx → ExTorch: wrap foreign memory
view = ExTorch.Tensor.Blob.from_blob(
  %{ptr: torchx_ptr, shape: {3, 224, 224}, dtype: :float32},
  owner: nx_tensor
)

CUDA Support

ExTorch.Native.cuda_is_available()    # => true
ExTorch.Native.cuda_device_count()    # => 2

model = ExTorch.Export.load("model.pt2", device: :cuda)
ExTorch.Native.cuda_memory_allocated(0)  # bytes on GPU 0

Deployment Examples

See examples/serving/ for production patterns:

See examples/models/ for real-world model deployment:

Architecture

Three-layer design: C++ (libtorch wrapper) → Rust (cxx bridge + Rustler NIFs) → Elixir (macro-generated API).

The generic c10::Dispatcher NIF bridge (dispatch_op, execute_graph, compile_graph) enables calling any PyTorch op without per-op C++ wrappers, and the OpHandler behaviour allows external packages to extend the Export interpreter.

License

MIT