BumblebeeQuantized
4-bit quantized LLM inference with LoRA adapters for Apple Silicon.
Run 8B parameter models in ~5GB RAM with full fine-tuning support.
Features
- 4-bit Quantized Inference - Run quantized models using MLX's fused Metal kernels
- Runtime LoRA Adapters - Load and apply fine-tuned adapters at inference time
- Training Integration - Train your own LoRA adapters via mlx_lm
- Apple Silicon Optimized - Uses unified memory for zero-copy GPU access
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- Elixir 1.15+
- Python 3.10+ with mlx_lm (for training only)
Installation
def deps do
[
{:bumblebee_quantized, "~> 0.1.0"},
# REQUIRED: EMLX with quantization ops (not on Hex yet)
{:emlx, github: "notactuallytreyanastasio/emlx", branch: "feat/quantization-ops"}
]
endNote: The EMLX quantization ops are pending upstream merge (PR #95). Once merged, you'll only need
{:bumblebee_quantized, "~> 0.1.0"}.
Quick Start
# Load a quantized model
{:ok, model} = BumblebeeQuantized.load_model(
"/path/to/Qwen3-8B-MLX-4bit"
)
# Load a LoRA adapter (optional)
{:ok, adapter} = BumblebeeQuantized.load_adapter("/path/to/adapter")
# Load tokenizer via Bumblebee
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})
# Create a serving and generate text
serving = BumblebeeQuantized.Serving.new(model, tokenizer,
adapter: adapter,
max_new_tokens: 100,
temperature: 0.8
)
Nx.Serving.run(serving, "Write a post about Elixir")Full Training Workflow
# 1. Prepare training data
posts = ["First post...", "Second post...", ...]
BumblebeeQuantized.Training.prepare_data(posts, "/path/to/data",
prompt: "Write a post in my style",
min_length: 160
)
# 2. Train adapter (calls Python mlx_lm)
{:ok, adapter_path} = BumblebeeQuantized.Training.train(
base_model: "lmstudio-community/Qwen3-8B-MLX-4bit",
training_data: "/path/to/data",
output_path: "/path/to/adapter",
iterations: 25_000,
rank: 8,
scale: 20.0
)
# 3. Load and use
{:ok, model} = BumblebeeQuantized.load_model("/path/to/Qwen3-8B-4bit")
{:ok, adapter} = BumblebeeQuantized.load_adapter(adapter_path)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-8B"})
serving = BumblebeeQuantized.Serving.new(model, tokenizer, adapter: adapter)
Nx.Serving.run(serving, "Write a post")Performance
Tested on Apple Silicon:
| Metric | Value |
|---|---|
| Model | Qwen3-8B-4bit |
| Memory Usage | ~5GB |
| Model Load Time | 4-6 seconds |
| Single Token Latency | ~7ms (135 tok/s) |
| Generation Throughput | ~21 tok/s |
Modules
| Module | Description |
|---|---|
BumblebeeQuantized.Loader | Load quantized models from safetensors |
BumblebeeQuantized.Adapters | Load, apply, and train LoRA adapters |
BumblebeeQuantized.Serving | Nx.Serving for text generation |
BumblebeeQuantized.Training | LoRA training workflow |
BumblebeeQuantized.Models.Qwen3 | Qwen3 quantized model definition |
Supported Models
Currently supported:
- Qwen3 (8B, other sizes should work)
Planned:
- LLaMA 2/3
- Mistral
How It Works
Quantized Weights: Models are stored in MLX 4-bit format with weight triplets (packed uint32, scales, biases)
EMLX Backend: Uses our EMLX fork with
quantized_matmulNIFRuntime LoRA: Adapters are applied at inference time:
output = base_output + scale * (x @ A @ B)Bumblebee Tokenizer: Uses Bumblebee's tokenizer for text encoding/decoding
Related Projects
- bobby_posts - The project that spawned this library
- EMLX Fork - EMLX with quantization ops
- safetensors_ex - Safetensors parser for Elixir
License
MIT