BumblebeeQuantized

4-bit quantized LLM inference with LoRA adapters for Apple Silicon.

Run 8B parameter models in ~5GB RAM with full fine-tuning support.

Features

Requirements

Installation

def deps do
  [
    {:bumblebee_quantized, "~> 0.1.0"},
    # REQUIRED: EMLX with quantization ops (not on Hex yet)
    {:emlx, github: "notactuallytreyanastasio/emlx", branch: "feat/quantization-ops"}
  ]
end

Note: The EMLX quantization ops are pending upstream merge (PR #95). Once merged, you'll only need {:bumblebee_quantized, "~> 0.1.0"}.

Quick Start

# Load a quantized model
{:ok, model} = BumblebeeQuantized.load_model(
  "/path/to/Qwen3-8B-MLX-4bit"
)

# Load a LoRA adapter (optional)
{:ok, adapter} = BumblebeeQuantized.load_adapter("/path/to/adapter")

# Load tokenizer via Bumblebee
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

# Create a serving and generate text
serving = BumblebeeQuantized.Serving.new(model, tokenizer,
  adapter: adapter,
  max_new_tokens: 100,
  temperature: 0.8
)

Nx.Serving.run(serving, "Write a post about Elixir")

Full Training Workflow

# 1. Prepare training data
posts = ["First post...", "Second post...", ...]

BumblebeeQuantized.Training.prepare_data(posts, "/path/to/data",
  prompt: "Write a post in my style",
  min_length: 160
)

# 2. Train adapter (calls Python mlx_lm)
{:ok, adapter_path} = BumblebeeQuantized.Training.train(
  base_model: "lmstudio-community/Qwen3-8B-MLX-4bit",
  training_data: "/path/to/data",
  output_path: "/path/to/adapter",
  iterations: 25_000,
  rank: 8,
  scale: 20.0
)

# 3. Load and use
{:ok, model} = BumblebeeQuantized.load_model("/path/to/Qwen3-8B-4bit")
{:ok, adapter} = BumblebeeQuantized.load_adapter(adapter_path)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})

serving = BumblebeeQuantized.Serving.new(model, tokenizer, adapter: adapter)
Nx.Serving.run(serving, "Write a post")

Performance

Tested on Apple Silicon:

Metric Value
Model Qwen3-8B-4bit
Memory Usage ~5GB
Model Load Time 4-6 seconds
Single Token Latency ~7ms (135 tok/s)
Generation Throughput ~21 tok/s

Modules

Module Description
BumblebeeQuantized.Loader Load quantized models from safetensors
BumblebeeQuantized.Adapters Load, apply, and train LoRA adapters
BumblebeeQuantized.Serving Nx.Serving for text generation
BumblebeeQuantized.Training LoRA training workflow
BumblebeeQuantized.Models.Qwen3 Qwen3 quantized model definition

Supported Models

Currently supported:

Planned:

How It Works

  1. Quantized Weights: Models are stored in MLX 4-bit format with weight triplets (packed uint32, scales, biases)

  2. EMLX Backend: Uses our EMLX fork with quantized_matmul NIF

  3. Runtime LoRA: Adapters are applied at inference time: output = base_output + scale * (x @ A @ B)

  4. Bumblebee Tokenizer: Uses Bumblebee's tokenizer for text encoding/decoding

Related Projects

License

MIT