Edifice

Hex.pmHex DocsLicense

A comprehensive ML architecture library for Elixir, built on Nx and Axon.

186 neural network architectures across 25 families — from MLPs to Mamba, transformers to graph networks, VAEs to spiking neurons, audio codecs to robotics, scientific ML to 3D generation.

Why Edifice?

The Elixir ML ecosystem has excellent numerical computing (Nx) and model building (Axon) foundations, but no comprehensive collection of ready-to-use architectures. Edifice fills that gap:

Installation

Add edifice to your dependencies in mix.exs:

def deps do
  [
    {:edifice, "~> 0.2.0"}
  ]
end

Edifice requires Nx ~> 0.10 and Axon ~> 0.8. For GPU acceleration, add EXLA:

{:exla, "~> 0.10"}

Tip: On Elixir 1.19+, set MIX_OS_DEPS_COMPILE_PARTITION_COUNT=4 to compile dependencies in parallel (up to 4x faster first build).

Quick Start

# Build any architecture by name
model = Edifice.build(:mamba, embed_size: 256, hidden_size: 512, num_layers: 4)

# Or use the module directly for more control
model = Edifice.SSM.Mamba.build(
  embed_size: 256,
  hidden_size: 512,
  state_size: 16,
  num_layers: 4,
  window_size: 60
)

# Build and run
{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 60, 256}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, input)

# Explore what's available
Edifice.list_architectures()
# => [:attention, :bayesian, :capsule, :deep_sets, :densenet, :diffusion, ...]

Edifice.list_families()
# => %{ssm: [:mamba, :mamba_ssd, :s5, ...], attention: [:attention, :retnet, ...], ...}

Architecture Families

Feedforward

Architecture Module Key Feature
MLPEdifice.Feedforward.MLP Multi-layer perceptron with configurable hidden sizes
KANEdifice.Feedforward.KAN Kolmogorov-Arnold Networks, learnable activation functions
KATEdifice.Feedforward.KAT Kolmogorov-Arnold Transformer (KAN + attention) (learnable activations)
TabNetEdifice.Feedforward.TabNet Attentive feature selection for tabular data
BitNetEdifice.Feedforward.BitNet Ternary/binary weight quantization (1.58-bit)

Transformer

Architecture Module Key Feature
Decoder-OnlyEdifice.Transformer.DecoderOnly GPT-style with GQA, RoPE/iRoPE, SwiGLU, RMSNorm
Multi-Token PredictionEdifice.Transformer.MultiTokenPrediction Predict next N tokens simultaneously
Byte Latent TransformerEdifice.Transformer.ByteLatentTransformer Byte-level processing via encoder-latent-decoder
Nemotron-HEdifice.Transformer.NemotronH NVIDIA’s hybrid Mamba-Transformer

State Space Models

Architecture Module Key Feature
S4Edifice.SSM.S4 HiPPO DPLR initialization, long-range memory
S4DEdifice.SSM.S4D Diagonal state space, simplified S4
S5Edifice.SSM.S5 MIMO diagonal SSM with D skip connection
H3Edifice.SSM.H3 Two SSMs with multiplicative gating + short convolution
HyenaEdifice.SSM.Hyena Long convolution hierarchy, implicit filters
MambaEdifice.SSM.Mamba Selective SSM, parallel associative scan
Mamba-2 (SSD)Edifice.SSM.MambaSSD Structured state space duality, chunk-wise matmul
Mamba (Cumsum)Edifice.SSM.MambaCumsum Mamba with configurable scan algorithm
Mamba (Hillis-Steele)Edifice.SSM.MambaHillisSteele Mamba with max-parallelism scan
BiMambaEdifice.SSM.BiMamba Bidirectional Mamba for non-causal tasks
GatedSSMEdifice.SSM.GatedSSM Gated temporal with gradient checkpointing
JambaEdifice.SSM.Hybrid Mamba + Attention hybrid (configurable ratio)
ZambaEdifice.SSM.Zamba Mamba + single shared attention layer
StripedHyenaEdifice.SSM.StripedHyena Interleaved Hyena long conv + gated conv
Mamba-3Edifice.SSM.Mamba3 Complex states, trapezoidal discretization, MIMO
GSSEdifice.SSM.GSS Gated State Space (simplified S4 with gating)
HymbaEdifice.SSM.Hymba Hybrid Mamba + attention with learnable meta tokens
SS TransformerEdifice.SSM.SSTransformer State Space Transformer

Attention & Linear Attention

Architecture Module Key Feature
Multi-Head AttentionEdifice.Attention.MultiHead Sliding window, QK LayerNorm
GQAEdifice.Attention.GQA Grouped Query Attention, fewer KV heads
PerceiverEdifice.Attention.Perceiver Cross-attention to learned latents, input-agnostic
FNetEdifice.Attention.FNet Fourier Transform replacing attention
Linear TransformerEdifice.Attention.LinearTransformer Kernel-based O(N) attention
NystromformerEdifice.Attention.Nystromformer Nystrom approximation of attention matrix
PerformerEdifice.Attention.Performer FAVOR+ random feature attention
RetNetEdifice.Attention.RetNet Multi-scale retention, O(1) recurrent inference
RWKV-7Edifice.Attention.RWKV Linear attention, O(1) space, “Goose” architecture
GLAEdifice.Attention.GLA Gated Linear Attention with data-dependent decay
HGRN-2Edifice.Attention.HGRN Hierarchically gated linear RNN, state expansion
Griffin/HawkEdifice.Attention.Griffin RG-LRU + local attention (Griffin) or pure RG-LRU (Hawk)
Diff TransformerEdifice.Attention.DiffTransformer Noise-cancelling dual softmax subtraction
MLAEdifice.Attention.MLA Multi-Head Latent Attention (DeepSeek KV compression)
BasedEdifice.Attention.Based Taylor expansion linear attention
MegaEdifice.Attention.Mega Moving average + gated attention
InfiniAttentionEdifice.Attention.InfiniAttention Compressive memory for unbounded context
ConformerEdifice.Attention.Conformer Conv-augmented transformer for audio/speech
Ring AttentionEdifice.Attention.RingAttention Distributed chunked attention for long sequences
Lightning AttentionEdifice.Attention.LightningAttention Hybrid linear/softmax with I/O-aware tiling
Gated AttentionEdifice.Attention.GatedAttention Sigmoid post-attention gate (NeurIPS 2025)
NSAEdifice.Attention.NSA Native Sparse Attention (DeepSeek three-path)
KDAEdifice.Attention.KDA Kimi Delta Attention, channel-wise decay
Flash Linear AttentionEdifice.Attention.FlashLinearAttention Optimized linear attention
YaRNEdifice.Attention.YARN RoPE context extension via frequency scaling
Dual ChunkEdifice.Attention.DualChunk Dual Chunk Attention for long-context
TMRoPEEdifice.Attention.TMRoPE Time-aligned Multimodal RoPE
RNoPE-SWAEdifice.Attention.RNoPESWA No positional encoding + sliding window

Recurrent Networks

Architecture Module Key Feature
LSTM/GRUEdifice.Recurrent Classic recurrent with multi-layer stacking
xLSTMEdifice.Recurrent.XLSTM Exponential gating, matrix memory (sLSTM/mLSTM)
MinGRUEdifice.Recurrent.MinGRU Minimal GRU, parallel-scannable
MinLSTMEdifice.Recurrent.MinLSTM Minimal LSTM, parallel-scannable
DeltaNetEdifice.Recurrent.DeltaNet Delta rule-based linear RNN
TTTEdifice.Recurrent.TTT Test-Time Training, self-supervised at inference
TitansEdifice.Recurrent.Titans Neural long-term memory, surprise-gated
ReservoirEdifice.Recurrent.Reservoir Echo State Networks with fixed random reservoir
sLSTMEdifice.Recurrent.SLSTM Scalar LSTM with exponential gating
xLSTM v2Edifice.Recurrent.XLSTMv2 Updated mLSTM with matrix memory
Gated DeltaNetEdifice.Recurrent.GatedDeltaNet Linear attention with data-dependent gating
TTT-E2EEdifice.Recurrent.TTTE2E End-to-end test-time training
Native RecurrenceEdifice.Recurrent.NativeRecurrence Native recurrence block

Vision

Architecture Module Key Feature
ViTEdifice.Vision.ViT Vision Transformer, patch embedding
DeiTEdifice.Vision.DeiT Data-efficient ViT with distillation token
SwinEdifice.Vision.SwinTransformer Shifted window attention, hierarchical features
U-NetEdifice.Vision.UNet Encoder-decoder with skip connections
ConvNeXtEdifice.Vision.ConvNeXt Modernized ConvNet with transformer-inspired design
MLP-MixerEdifice.Vision.MLPMixer Pure MLP with token/channel mixing
FocalNetEdifice.Vision.FocalNet Focal modulation, hierarchical context
PoolFormerEdifice.Vision.PoolFormer Average pooling token mixer (MetaFormer)
NeRFEdifice.Vision.NeRF Neural radiance field, coordinate-to-color mapping
Gaussian SplatEdifice.Vision.GaussianSplat 3D Gaussian Splatting (NeRF successor)
MambaVisionEdifice.Vision.MambaVision 4-stage hierarchical CNN+Mamba+Attention
DINOv2Edifice.Vision.DINOv2 Self-distillation vision backbone
MetaFormerEdifice.Vision.MetaFormer Architecture-first framework (+ CAFormer variant)
EfficientViTEdifice.Vision.EfficientViT Linear attention ViT

Convolutional

Architecture Module Key Feature
Conv1D/2DEdifice.Convolutional.Conv Configurable convolution blocks with BN, activation, dropout
ResNetEdifice.Convolutional.ResNet Residual/bottleneck blocks, configurable depth
DenseNetEdifice.Convolutional.DenseNet Dense connections, feature reuse
TCNEdifice.Convolutional.TCN Dilated causal convolutions for sequences
MobileNetEdifice.Convolutional.MobileNet Depthwise separable convolutions
EfficientNetEdifice.Convolutional.EfficientNet Compound scaling (depth, width, resolution)

Generative Models

Architecture Module Key Feature
VAEEdifice.Generative.VAE Reparameterization trick, KL divergence, beta-VAE
VQ-VAEEdifice.Generative.VQVAE Discrete codebook, straight-through estimator
GANEdifice.Generative.GAN Generator/discriminator, WGAN-GP support
Diffusion (DDPM)Edifice.Generative.Diffusion Denoising diffusion, sinusoidal time embedding
DDIMEdifice.Generative.DDIM Deterministic diffusion sampling, fast inference
DiTEdifice.Generative.DiT Diffusion Transformer, AdaLN-Zero conditioning
Latent DiffusionEdifice.Generative.LatentDiffusion Diffusion in compressed latent space
Consistency ModelEdifice.Generative.ConsistencyModel Single-step generation via consistency training
Score SDEEdifice.Generative.ScoreSDE Continuous SDE framework (VP-SDE, VE-SDE)
Flow MatchingEdifice.Generative.FlowMatching ODE-based generation, multiple loss variants
Normalizing FlowEdifice.Generative.NormalizingFlow Affine coupling layers (RealNVP-style)
MMDiTEdifice.Generative.MMDiT Multimodal Diffusion Transformer (FLUX.1, SD3)
SoFlowEdifice.Generative.SoFlow Flow matching + consistency loss
VAREdifice.Generative.VAR Visual Autoregressive (next-scale prediction)
Linear DiT (SANA)Edifice.Generative.LinearDiT Linear attention for diffusion, 100x speedup
SiTEdifice.Generative.SiT Scalable Interpolant Transformer
TransfusionEdifice.Generative.Transfusion Unified AR text + diffusion images
MAREdifice.Generative.MAR Masked Autoregressive generation
CogVideoXEdifice.Generative.CogVideoX 3D causal VAE + expert transformer for video
TRELLISEdifice.Generative.TRELLIS Sparse 3D lattice + rectified flow

Contrastive & Self-Supervised

Architecture Module Key Feature
SimCLREdifice.Contrastive.SimCLR NT-Xent contrastive loss, projection head
BYOLEdifice.Contrastive.BYOL No negatives, momentum encoder
Barlow TwinsEdifice.Contrastive.BarlowTwins Cross-correlation redundancy reduction
MAEEdifice.Contrastive.MAE Masked Autoencoder, 75% patch masking
VICRegEdifice.Contrastive.VICReg Variance-Invariance-Covariance regularization
JEPAEdifice.Contrastive.JEPA Joint Embedding Predictive Architecture
Temporal JEPAEdifice.Contrastive.TemporalJEPA V-JEPA for video/temporal sequences
SigLIPEdifice.Contrastive.SigLIP Sigmoid contrastive learning (CLIP improvement)

Graph & Set Networks

Architecture Module Key Feature
GCNEdifice.Graph.GCN Spectral graph convolutions (Kipf & Welling)
GATEdifice.Graph.GAT Graph attention with multi-head support
GINEdifice.Graph.GIN Graph Isomorphism Network, maximally expressive
GraphSAGEEdifice.Graph.GraphSAGE Inductive learning, neighborhood sampling
Graph TransformerEdifice.Graph.GraphTransformer Full attention over nodes with edge features
PNAEdifice.Graph.PNA Principal Neighbourhood Aggregation
GINv2Edifice.Graph.GINv2 GIN with edge features
SchNetEdifice.Graph.SchNet Continuous-filter convolutions for molecules
EGNNEdifice.Graph.EGNN E(n)-equivariant GNN for molecular simulation
DeepSetsEdifice.Sets.DeepSets Permutation-invariant set functions
PointNetEdifice.Sets.PointNet Point cloud processing with T-Net alignment

Energy, Probabilistic & Memory

Architecture Module Key Feature
EBMEdifice.Energy.EBM Energy-based models, contrastive divergence
HopfieldEdifice.Energy.Hopfield Modern continuous Hopfield networks
Neural ODEEdifice.Energy.NeuralODE Continuous-depth networks via ODE solvers
Bayesian NNEdifice.Probabilistic.Bayesian Weight uncertainty, variational inference
MC DropoutEdifice.Probabilistic.MCDropout Uncertainty estimation via dropout at inference
Evidential NNEdifice.Probabilistic.EvidentialNN Dirichlet priors for uncertainty
NTMEdifice.Memory.NTM Neural Turing Machine, differentiable memory
Memory NetworkEdifice.Memory.MemoryNetwork End-to-end memory with multi-hop attention
EngramEdifice.Memory.Engram O(1) hash-based associative memory

Meta-Learning & Specialized

Architecture Module Key Feature
MoEEdifice.Meta.MoE Mixture of Experts with top-k/hash routing
Switch MoEEdifice.Meta.SwitchMoE Top-1 routing with load balancing
Soft MoEEdifice.Meta.SoftMoE Fully differentiable soft token routing
LoRAEdifice.Meta.LoRA Low-Rank Adaptation for parameter-efficient fine-tuning
AdapterEdifice.Meta.Adapter Bottleneck adapter modules for transfer learning
HypernetworkEdifice.Meta.Hypernetwork Networks that generate other networks’ weights
CapsuleEdifice.Meta.Capsule Dynamic routing between capsules
MixtureOfDepthsEdifice.Meta.MixtureOfDepths Dynamic per-token compute allocation
MixtureOfAgentsEdifice.Meta.MixtureOfAgents Multi-model proposer + aggregator
RLHF HeadEdifice.Meta.RLHFHead Reward model and preference heads
DPOEdifice.Meta.DPO Direct Preference Optimization
GRPOEdifice.Meta.GRPO Group Relative Policy Optimization (DeepSeek-R1)
KTOEdifice.Meta.KTO Kahneman-Tversky Optimization (binary feedback)
MoE v2Edifice.Meta.MoEv2 Expert-choice routing + shared experts + bias balancing
DoRAEdifice.Meta.DoRA Weight-decomposed LoRA
Speculative DecodingEdifice.Meta.SpeculativeDecoding Draft + verify inference acceleration
Test-Time ComputeEdifice.Meta.TestTimeCompute Adaptive test-time compute
Mixture of TokenizersEdifice.Meta.MixtureOfTokenizers Multi-tokenization expert routing
QATEdifice.Meta.QAT Quantization-Aware Training
Hybrid BuilderEdifice.Meta.HybridBuilder Configurable SSM/Attention ratio
Liquid NNEdifice.Liquid Continuous-time ODE dynamics (LTC cells)
SNNEdifice.Neuromorphic.SNN Leaky integrate-and-fire, surrogate gradients
ANN2SNNEdifice.Neuromorphic.ANN2SNN Convert trained ANNs to spiking networks

Interpretability

Architecture Module Key Feature
Sparse AutoencoderEdifice.Interpretability.SparseAutoencoder Feature extraction from model activations
TranscoderEdifice.Interpretability.Transcoder Cross-layer mechanistic interpretability

Scientific ML

Architecture Module Key Feature
FNOEdifice.Scientific.FNO Fourier Neural Operator for solving PDEs

Audio

Architecture Module Key Feature
EnCodecEdifice.Audio.EnCodec Neural audio codec (encoder → RVQ → decoder)
VALL-EEdifice.Audio.VALLE Codec language model for zero-shot TTS
SoundStormEdifice.Audio.SoundStorm Parallel audio token generation

Robotics

Architecture Module Key Feature
ACTEdifice.Robotics.ACT Action Chunking Transformer for imitation learning
OpenVLAEdifice.Robotics.OpenVLA Vision-Language-Action model for robot control

RL & World Models

Architecture Module Key Feature
PolicyValueEdifice.RL.PolicyValue Actor-critic policy-value network
World ModelEdifice.WorldModel.WorldModel Encoder + dynamics + reward head
MedusaEdifice.Inference.Medusa Multi-head speculative decoding

Multimodal

Architecture Module Key Feature
Multimodal FusionEdifice.Multimodal.Fusion MLP projection, cross-attention, Perceiver resampler

Building Blocks

Block Module Key Feature
RMSNormEdifice.Blocks.RMSNorm Root Mean Square normalization
SwiGLUEdifice.Blocks.SwiGLU Gated FFN with SiLU activation
RoPEEdifice.Blocks.RoPE Rotary position embedding
ALiBiEdifice.Blocks.ALiBi Attention with linear biases
Patch EmbedEdifice.Blocks.PatchEmbed Image-to-patch tokenization
Sinusoidal PEEdifice.Blocks.SinusoidalPE Fixed sinusoidal position encoding
Adaptive NormEdifice.Blocks.AdaptiveNorm Condition-dependent normalization (AdaLN)
Cross AttentionEdifice.Blocks.CrossAttention Cross-attention between two sequences
Conv1D/2DEdifice.Convolutional.Conv Configurable convolution blocks
FFNEdifice.Blocks.FFN Standard and gated feed-forward networks
Transformer BlockEdifice.Blocks.TransformerBlock Pre-norm block with pluggable attention
Causal MaskEdifice.Blocks.CausalMask Unified causal mask creation
Depthwise ConvEdifice.Blocks.DepthwiseConv 1D depthwise separable convolution
Model BuilderEdifice.Blocks.ModelBuilder Sequence/vision model skeletons
Message PassingEdifice.Graph.MessagePassing Generic MPNN framework, global pooling
Scalable-SoftmaxEdifice.Blocks.SSMax Drop-in softmax replacement for long sequences
SoftpickEdifice.Blocks.Softpick Non-saturating sparse attention function
KV CacheEdifice.Blocks.KVCache Inference-time KV caching

Guides

New to ML?

Start here if you’re new to machine learning. These guides build from zero to fluency with Edifice’s API and architecture families.

  1. ML Foundations — What neural networks are, how they learn, tensors and shapes
  2. Core Vocabulary — Essential terminology used across all guides
  3. The Problem Landscape — Classification, generation, sequence modeling — which architectures solve which problems
  4. Reading Edifice — The build/init/predict pattern, Axon graphs, shapes, and runnable examples
  5. Learning Path — A guided tour through the architecture families

Reference

Architecture Guides

Conceptual guides covering theory, architecture evolution, and decision tables for each family.

Sequence Processing

Representation Learning

Generative & Dynamic

Composition & Enhancement

Examples

See examples/ for runnable scripts including mlp_basics.exs, sequence_comparison.exs, graph_classification.exs, vae_generation.exs, and architecture_tour.exs.

Mamba for Sequence Modeling

model = Edifice.SSM.Mamba.build(
  embed_size: 128,
  hidden_size: 256,
  state_size: 16,
  num_layers: 4,
  window_size: 100
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 100, 128}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {1, 100, 128}))
# => {1, 256}

Graph Classification with GCN

model = Edifice.Graph.GCN.build_classifier(
  input_dim: 16,
  hidden_dims: [64, 64],
  num_classes: 2,
  pool: :mean
)

{init_fn, predict_fn} = Axon.build(model)

params = init_fn.(
  %{
    "nodes" => Nx.template({4, 10, 16}, :f32),
    "adjacency" => Nx.template({4, 10, 10}, :f32)
  },
  Axon.ModelState.empty()
)

output = predict_fn.(params, %{
  "nodes" => Nx.broadcast(0.5, {4, 10, 16}),
  "adjacency" => Nx.eye(10) |> Nx.broadcast({4, 10, 10})
})
# => {4, 2}

VAE with Reparameterization

{encoder, decoder} = Edifice.Generative.VAE.build(
  input_size: 784,
  latent_size: 32,
  encoder_sizes: [512, 256],
  decoder_sizes: [256, 512]
)

# Encoder outputs mu and log_var
{init_fn, predict_fn} = Axon.build(encoder)
params = init_fn.(Nx.template({1, 784}, :f32), Axon.ModelState.empty())
%{mu: mu, log_var: log_var} = predict_fn.(params, Nx.broadcast(0.5, {1, 784}))

# Sample latent vector (requires PRNG key for stochastic sampling)
key = Nx.Random.key(42)
{z, _new_key} = Edifice.Generative.VAE.reparameterize(mu, log_var, key)

# KL divergence for training
kl_loss = Edifice.Generative.VAE.kl_divergence(mu, log_var)

Permutation-Invariant Set Processing

model = Edifice.Sets.DeepSets.build(
  input_dim: 3,
  hidden_dim: 64,
  output_dim: 10,
  pool: :mean
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({4, 20, 3}, :f32), Axon.ModelState.empty())
# Process sets of 20 3D points
output = predict_fn.(params, Nx.broadcast(0.5, {4, 20, 3}))
# => {4, 10}

API Design

Every architecture module follows the same pattern:

# Module.build(opts) returns an Axon model
model = Edifice.SSM.Mamba.build(embed_size: 256, hidden_size: 512)

# Some modules expose layer-level builders for composition
layer = Edifice.Graph.GCN.gcn_layer(nodes, adjacency, output_dim)

# Generative models may return tuples
{encoder, decoder} = Edifice.Generative.VAE.build(input_size: 784)

# Utility functions for training
loss = Edifice.Generative.VAE.loss(reconstruction, target, mu, log_var)
energy = Edifice.Energy.Hopfield.energy(query, patterns, beta)

The unified registry lets you build any architecture by name:

# Useful for hyperparameter search, config-driven experiments
for arch <- [:mamba, :retnet, :griffin, :gla] do
  model = Edifice.build(arch, embed_size: 256, hidden_size: 512, num_layers: 4)
  # ... train and evaluate
end

Requirements

License

MIT License. See LICENSE for details.