Magika
An Elixir binding of Google's Magika — deep-learning file content type detection.
Magika identifies the content type of a file (e.g. html, python, pdf,
zip, png) from its bytes using a small, fast ONNX model. This library is a
faithful port of the reference Python implementation's standard_v3_3 model:
it runs the same ONNX model (vendored under priv/) via
OnnxRuntime and reproduces the same
feature extraction, confidence thresholds, and label-resolution logic. Its
output matches the Python tool exactly on the upstream test corpus (77/77
files).
How it works
- Corner cases first. Empty inputs →
empty; very small or whitespace-only inputs are classified astxt/unknownby a UTF-8 check, without invoking the model. Directories and symlinks get dedicated labels. - Feature extraction. For larger inputs, Magika reads up to
block_size(4096) bytes from the start and end, strips ASCII whitespace, and takesbeg_size(1024) bytes from the front andend_size(1024) from the back, padding with a dedicated padding token (256). This yields a 2048-int vector. - Inference. The vector is fed to the ONNX model (
int32[batch, 2048] → float32[batch, 214], a softmax over 214 content types). The argmax is the raw "deep-learning" label and its probability is the score. - Label resolution. An overwrite map and per-content-type confidence
thresholds turn the raw label into the final output. Low-confidence
predictions are generalized to
txt(text) orunknown(binary).
Installation
Add magika to your dependencies in mix.exs:
def deps do
[
{:magika, "~> 0.1.0"}
]
end
The ONNX runtime is provided by the
onnxruntime package (Elixir bindings
for Microsoft ONNX Runtime), which fetches precompiled native binaries for
common platforms, so no manual toolchain setup is required for installation.
Usage
The model is loaded once and hosted by a supervised Magika.Server that starts
automatically with the :magika application — so you call the API directly,
without managing or passing around an instance:
# Identify raw bytes:
{:ok, result} = Magika.identify("<!DOCTYPE html>\n<html>...</html>")
result.prediction.output.label #=> "html"
result.prediction.output.mime_type #=> "text/html"
result.prediction.output.group #=> "code"
result.prediction.score #=> 0.86...
# Identify a file on disk:
{:ok, result} = Magika.identify_path("/path/to/document.pdf")
result.prediction.output.label #=> "pdf"
# Missing/unreadable files return an {:error, result} with a status:
{:error, result} = Magika.identify_path("/nope")
result.status #=> :file_not_found
# Identify from an open binary device:
{:ok, device} = File.open("photo.png", [:read, :binary])
{:ok, result} = Magika.identify_stream(device)
File.close(device)
result.prediction.output.label #=> "png"
Inference runs in the calling process: the server owns the instance's
lifecycle and publishes it via :persistent_term, so concurrent calls don't
serialize through a single mailbox and the configuration isn't copied per call.
Prediction mode
The prediction mode controls how strict Magika is before trusting the model's
guess. The hosted server uses :high_confidence by default; change it in your
application config:
# config/config.exs
config :magika, prediction_mode: :best_guess
:high_confidence(default) — keep the model prediction only when its score clears the per-content-type threshold (falling back to the medium threshold otherwise).:medium_confidence— keep it when the score clears the generic medium threshold.:best_guess— always return the raw model prediction.
When the score is too low for the chosen mode, the output is generalized to
txt (for textual content types) or unknown (for binary ones), and
result.prediction.overwrite_reason is set to :low_confidence.
Standalone instances (advanced)
You normally don't need this. For one-off scripts or tests you can build an
instance directly with Magika.new/1 and pass it as the first argument,
bypassing the supervised server:
magika = Magika.new(prediction_mode: :best_guess)
{:ok, result} = Magika.identify(magika, "<!DOCTYPE html>...")
Result shape
Magika.identify*/2 returns {:ok, %Magika.Result{}} or, for filesystem
errors, {:error, %Magika.Result{}}:
%Magika.Result{
status: :ok, # or :file_not_found | :permission_error
path: "/path/to/file" | nil,
prediction: %Magika.Prediction{
output: %Magika.ContentTypeInfo{ # what Magika reports to you
label: "python",
mime_type: "text/x-python",
group: "code",
description: "Python source",
extensions: ["py", ...],
is_text: true
},
dl: %Magika.ContentTypeInfo{...}, # raw model prediction (or `undefined`)
score: 0.9998, # model confidence for `dl`
overwrite_reason: :none # :none | :low_confidence | :overwrite_map
}
}
Model
The vendored model is Magika's standard_v3_3 (Apache-2.0), copied verbatim
from the upstream repository along with its
config.min.json and content_types_kb.min.json.
License
Apache-2.0, matching upstream Magika. The bundled model and configuration files are © Google LLC.