BumblebeeQuantized

4-bit quantized LLM inference with LoRA adapters for Apple Silicon.

Run 8B parameter models in ~5GB RAM with full fine-tuning support.

Features

4-bit Quantized Inference - Run quantized models using MLX's fused Metal kernels
Runtime LoRA Adapters - Load and apply fine-tuned adapters at inference time
Training Integration - Train your own LoRA adapters via mlx_lm
Apple Silicon Optimized - Uses unified memory for zero-copy GPU access

Requirements

Apple Silicon Mac (M1/M2/M3/M4)
Elixir 1.15+
Python 3.10+ with mlx_lm (for training only)

Installation

def deps do
  [
    {:bumblebee_quantized, "~> 0.1.0"},
    # REQUIRED: EMLX with quantization ops (not on Hex yet)
    {:emlx, github: "notactuallytreyanastasio/emlx", branch: "feat/quantization-ops"}
  ]
end

Note: The EMLX quantization ops are pending upstream merge (PR #95). Once merged, you'll only need {:bumblebee_quantized, "~> 0.1.0"}.

Quick Start

# Load a quantized model
{:ok, model} = BumblebeeQuantized.load_model(
  "/path/to/Qwen3-8B-MLX-4bit"
)

# Load a LoRA adapter (optional)
{:ok, adapter} = BumblebeeQuantized.load_adapter("/path/to/adapter")

# Load tokenizer via Bumblebee
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

# Create a serving and generate text
serving = BumblebeeQuantized.Serving.new(model, tokenizer,
  adapter: adapter,
  max_new_tokens: 100,
  temperature: 0.8
)

Nx.Serving.run(serving, "Write a post about Elixir")

Full Training Workflow

# 1. Prepare training data
posts = ["First post...", "Second post...", ...]

BumblebeeQuantized.Training.prepare_data(posts, "/path/to/data",
  prompt: "Write a post in my style",
  min_length: 160
)

# 2. Train adapter (calls Python mlx_lm)
{:ok, adapter_path} = BumblebeeQuantized.Training.train(
  base_model: "lmstudio-community/Qwen3-8B-MLX-4bit",
  training_data: "/path/to/data",
  output_path: "/path/to/adapter",
  iterations: 25_000,
  rank: 8,
  scale: 20.0
)

# 3. Load and use
{:ok, model} = BumblebeeQuantized.load_model("/path/to/Qwen3-8B-4bit")
{:ok, adapter} = BumblebeeQuantized.load_adapter(adapter_path)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

serving = BumblebeeQuantized.Serving.new(model, tokenizer, adapter: adapter)
Nx.Serving.run(serving, "Write a post")

Performance

Tested on Apple Silicon:

Metric	Value
Model	Qwen3-8B-4bit
Memory Usage	~5GB
Model Load Time	4-6 seconds
Single Token Latency	~7ms (135 tok/s)
Generation Throughput	~21 tok/s

Modules

Module	Description
`BumblebeeQuantized.Loader`	Load quantized models from safetensors
`BumblebeeQuantized.Adapters`	Load, apply, and train LoRA adapters
`BumblebeeQuantized.Serving`	Nx.Serving for text generation
`BumblebeeQuantized.Training`	LoRA training workflow
`BumblebeeQuantized.Models.Qwen3`	Qwen3 quantized model definition

Supported Models

Currently supported:

Qwen3 (8B, other sizes should work)

Planned:

LLaMA 2/3
Mistral

How It Works

Quantized Weights: Models are stored in MLX 4-bit format with weight triplets (packed uint32, scales, biases)
EMLX Backend: Uses our EMLX fork with quantized_matmul NIF
Runtime LoRA: Adapters are applied at inference time: output = base_output + scale * (x @ A @ B)
Bumblebee Tokenizer: Uses Bumblebee's tokenizer for text encoding/decoding

Related Projects

bobby_posts - The project that spawned this library
EMLX Fork - EMLX with quantization ops
safetensors_ex - Safetensors parser for Elixir

License

MIT