Magika

An Elixir binding of Google's Magika — deep-learning file content type detection.

Magika identifies the content type of a file (e.g. html, python, pdf, zip, png) from its bytes using a small, fast ONNX model. This library is a faithful port of the reference Python implementation's standard_v3_3 model: it runs the same ONNX model (vendored under priv/) via OnnxRuntime and reproduces the same feature extraction, confidence thresholds, and label-resolution logic. Its output matches the Python tool exactly on the upstream test corpus (77/77 files).

How it works

  1. Corner cases first. Empty inputs → empty; very small or whitespace-only inputs are classified as txt/unknown by a UTF-8 check, without invoking the model. Directories and symlinks get dedicated labels.
  2. Feature extraction. For larger inputs, Magika reads up to block_size (4096) bytes from the start and end, strips ASCII whitespace, and takes beg_size (1024) bytes from the front and end_size (1024) from the back, padding with a dedicated padding token (256). This yields a 2048-int vector.
  3. Inference. The vector is fed to the ONNX model (int32[batch, 2048] → float32[batch, 214], a softmax over 214 content types). The argmax is the raw "deep-learning" label and its probability is the score.
  4. Label resolution. An overwrite map and per-content-type confidence thresholds turn the raw label into the final output. Low-confidence predictions are generalized to txt (text) or unknown (binary).

Installation

Add magika to your dependencies in mix.exs:

def deps do
[
{:magika, "~> 0.1.0"}
]
end

The ONNX runtime is provided by the onnxruntime package (Elixir bindings for Microsoft ONNX Runtime), which fetches precompiled native binaries for common platforms, so no manual toolchain setup is required for installation.

Usage

The model is loaded once and hosted by a supervised Magika.Server that starts automatically with the :magika application — so you call the API directly, without managing or passing around an instance:

# Identify raw bytes:
{:ok, result} = Magika.identify("<!DOCTYPE html>\n<html>...</html>")
result.prediction.output.label #=> "html"
result.prediction.output.mime_type #=> "text/html"
result.prediction.output.group #=> "code"
result.prediction.score #=> 0.86...
# Identify a file on disk:
{:ok, result} = Magika.identify_path("/path/to/document.pdf")
result.prediction.output.label #=> "pdf"
# Missing/unreadable files return an {:error, result} with a status:
{:error, result} = Magika.identify_path("/nope")
result.status #=> :file_not_found
# Identify from an open binary device:
{:ok, device} = File.open("photo.png", [:read, :binary])
{:ok, result} = Magika.identify_stream(device)
File.close(device)
result.prediction.output.label #=> "png"

Inference runs in the calling process: the server owns the instance's lifecycle and publishes it via :persistent_term, so concurrent calls don't serialize through a single mailbox and the configuration isn't copied per call.

Prediction mode

The prediction mode controls how strict Magika is before trusting the model's guess. The hosted server uses :high_confidence by default; change it in your application config:

# config/config.exs
config :magika, prediction_mode: :best_guess

When the score is too low for the chosen mode, the output is generalized to txt (for textual content types) or unknown (for binary ones), and result.prediction.overwrite_reason is set to :low_confidence.

Standalone instances (advanced)

You normally don't need this. For one-off scripts or tests you can build an instance directly with Magika.new/1 and pass it as the first argument, bypassing the supervised server:

magika = Magika.new(prediction_mode: :best_guess)
{:ok, result} = Magika.identify(magika, "<!DOCTYPE html>...")

Result shape

Magika.identify*/2 returns {:ok, %Magika.Result{}} or, for filesystem errors, {:error, %Magika.Result{}}:

%Magika.Result{
status: :ok, # or :file_not_found | :permission_error
path: "/path/to/file" | nil,
prediction: %Magika.Prediction{
output: %Magika.ContentTypeInfo{ # what Magika reports to you
label: "python",
mime_type: "text/x-python",
group: "code",
description: "Python source",
extensions: ["py", ...],
is_text: true
},
dl: %Magika.ContentTypeInfo{...}, # raw model prediction (or `undefined`)
score: 0.9998, # model confidence for `dl`
overwrite_reason: :none # :none | :low_confidence | :overwrite_map
}
}

Model

The vendored model is Magika's standard_v3_3 (Apache-2.0), copied verbatim from the upstream repository along with its config.min.json and content_types_kb.min.json.

License

Apache-2.0, matching upstream Magika. The bundled model and configuration files are © Google LLC.