Image.OCR (image_ocr)

Idiomatic Elixir interface to the Tesseract OCR engine. Implemented as a NIF over the Tesseract 5.x C++ API; accepts Vix.Vips.Image structs, file paths, or in-memory image binaries and returns recognised text.

Requirements

Installation

def deps do
  [
    {:image_ocr, "~> 0.1.0"}
  ]
end

Build the NIF on first compile:

mix deps.get
mix compile

image_ocr ships the English (eng) tessdata_fast model in priv/tessdata/ so the package is usable out of the box.

Quick start

{:ok, ocr}  = Image.OCR.new()                       # defaults to locale: "en"
{:ok, text} = Image.OCR.read_text(ocr, "page.png")

read_text/3 accepts:

For per-word output with confidence and bounding boxes, use Image.OCR.recognize/3:

{:ok, words} = Image.OCR.recognize(ocr, image)
# => [%{text: "Hello", confidence: 96.4, bbox: {32, 18, 198, 64}}, …]

Locales

The :locale option (and the mix-task language arguments) accept:

"zh" on its own is rejected as ambiguous — use "zh-Hans" or "zh-Hant". See Image.OCR.Languages for the full mapping table.

To enable BCP-47 parsing add Localize to your project:

def deps do
  [
    {:image_ocr, "~> 0.1.0"},
    {:localize, "~> 0.25"}
  ]
end

Concurrency

A single Image.OCR instance wraps one tesseract::TessBaseAPI, which is not safe for concurrent use. The NIF guards each instance with a mutex so accidental sharing degrades to serialisation rather than UB, but for real parallelism you want one instance per worker. The simplest way is the included pool:

children = [
  {Image.OCR.Pool, name: MyOcr, locale: "en", pool_size: 4}
]

Supervisor.start_link(children, strategy: :one_for_one)

{:ok, text} = Image.OCR.Pool.read_text(MyOcr, "page.png")

pool_size defaults to System.schedulers_online(). Each worker holds the loaded language model in memory — typically 2–50 MB depending on the language and trained-data variant — so size deliberately if you also load multiple languages or run on small hosts.

Recognition runs on dirty CPU schedulers, so it does not block the normal schedulers regardless of pool size.

Trained-data (tessdata)

The trained-data directory is resolved in this order:

  1. The :datapath option passed to Image.OCR.new/1.
  2. Application.get_env(:image_ocr, :tessdata_path).
  3. The TESSDATA_PREFIX environment variable.
  4. The vendored fallback at priv/tessdata/.

Configure a project-wide location once:

# config/config.exs
config :image_ocr, tessdata_path: "/var/lib/image_ocr/tessdata"

Mix tasks

Manage trained-data files without leaving your project:

# Install one or more languages (ISO 639-1 codes)
mix image.ocr.tessdata.add fr de

# BCP-47 for region/script-specific variants
mix image.ocr.tessdata.add zh-Hans zh-Hant sr-Latn

# Pick a variant: fast (default, ~2-4 MB), best (~10-15 MB), legacy (largest)
mix image.ocr.tessdata.add en --variant best

# Write to a specific directory (overrides config and TESSDATA_PREFIX)
mix image.ocr.tessdata.add ja --path /var/lib/tessdata

# Refresh every installed language to its latest upstream commit
mix image.ocr.tessdata.update

# Show what's installed
mix image.ocr.tessdata.list

# Remove a language
mix image.ocr.tessdata.remove de

The tasks read from and write to the same path that Image.OCR.new/1 does, so there is one source of truth.

Tesseract 4.x vs 5.x

image_ocr requires Tesseract 5.x (currently 5.5+) and refuses to build against older versions. 5.x is actively maintained, ships in current LTS distros, and runs noticeably faster than 4.x on modern CPUs thanks to better SIMD use and float32 models. The C++ API surface we use is identical between 4.x and 5.x, so 4.1+ would likely work — but we keep the support matrix tight.

Livebook

An interactive demonstration is at livebooks/demo.livemd. It covers one-shot OCR, reusable instances, per-word bounding boxes, the NimblePool, PSM/SetVariable tweaks, and uploading your own image.

License

Apache-2.0.