tesseract_js

Phoenix-friendly wrapper for tesseract.js. Run OCR in the browser — manga, document scanning, receipt parsing, anything with text in an image — without writing tesseract.js boilerplate.

Installation

def deps do
  [
    {:tesseract_js, "~> 0.1"}
  ]
end
mix deps.get
mix tesseract_js.install_assets    # copies the bundled JS into priv/static/

In your layout:

<%!-- root.html.heex --%>
<TesseractJs.Component.preload />
<TesseractJs.Component.script />

In your JS:

const { getOcrWorker, recognize } = window.TesseractJs;
const { data } = await recognize(canvasOrImg);
console.log(data.text);

That's it. CDN mode (jsDelivr) is the default — no setup needed.

Configuration

# config/config.exs
config :tesseract_js,
  lang: "eng",                # default language
  source: :cdn,               # :cdn (default) or :local
  tessdata_repo: :standard,   # :standard or :best
  core_variant: :simd_lstm    # :simd_lstm | :simd | :basic

Languages

Pick any tesseract language code, or combine with +:

config :tesseract_js, lang: "eng+jpn_vert"

A curated registry is shipped for help text + checksums (mix tesseract_js.download --list):

code name code name
eng English nld Dutch
jpn Japanese pol Polish
jpn_vert Japanese (vertical, manga-friendly) tur Turkish
chi_sim Chinese (simplified) vie Vietnamese
chi_tra Chinese (traditional) tha Thai
kor Korean ukr Ukrainian
fra French ara Arabic
deu German hin Hindi
spa Spanish rus Russian
ita Italian por Portuguese

Any code outside the curated list (e.g. swe, nor, dan, heb) still works at runtime — it just falls through to the URL template without checksum verification. Full list of language codes: tesseract-ocr/tessdata · LANGUAGES.

Quality tiers (tessdata_repo)

tier size accuracy notes
:standard (default) ~11 MB/lang gzipped full LSTM+legacy combined
:best ~3 MB/lang gzipped LSTM-only (the _best_int jsDelivr variant) smaller and faster to download, similar accuracy for most langs

The :fast tier from tesseract-ocr/tessdata_fast requires uncompressed .traineddata files served from a different source. Slated for v0.2.

Local mode

Switch off jsDelivr — useful in production, restricted networks, or offline:

mix tesseract_js.download eng jpn
# core + two langs into priv/static/assets/vendor/tesseract/

Then in config.exs:

config :tesseract_js, source: :local

Task options:

mix tesseract_js.download --tier best eng jpn  # smaller LSTM-only model
mix tesseract_js.download --core-only          # just the WASM core
mix tesseract_js.download --list               # print the registry
mix tesseract_js.download --force              # re-download

CDN URLs & manual downloads

The package builds these URLs from TesseractJs.Models. Use the same URLs to download files manually (curl, wget, browser, mirror) if the Mix task can't reach the network from where you're deploying.

Pinned versions (v0.1.0)

component version source
tesseract.js-core (WASM runtime) 5.1.1npm
tessdata (language models) 4.0.0@tesseract.js-data/* on npm

Core WASM (one variant per app)

# SIMD + LSTM (default, fastest on modern CPUs — ~3.8 MB)
curl -O https://cdn.jsdelivr.net/npm/tesseract.js-core@5.1.1/tesseract-core-simd-lstm.wasm.js

# SIMD only
curl -O https://cdn.jsdelivr.net/npm/tesseract.js-core@5.1.1/tesseract-core-simd.wasm.js

# Basic (no SIMD — fallback for very old browsers / restricted environments)
curl -O https://cdn.jsdelivr.net/npm/tesseract.js-core@5.1.1/tesseract-core.wasm.js

Language models — :standard tier (full LSTM+legacy, ~11 MB/lang gzipped)

URL template: https://cdn.jsdelivr.net/npm/@tesseract.js-data/<LANG>@1.0.0/4.0.0/<LANG>.traineddata.gz

# English
curl -O https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng@1.0.0/4.0.0/eng.traineddata.gz

# Japanese
curl -O https://cdn.jsdelivr.net/npm/@tesseract.js-data/jpn@1.0.0/4.0.0/jpn.traineddata.gz

# Japanese (vertical, manga-friendly)
curl -O https://cdn.jsdelivr.net/npm/@tesseract.js-data/jpn_vert@1.0.0/4.0.0/jpn_vert.traineddata.gz

# Korean
curl -O https://cdn.jsdelivr.net/npm/@tesseract.js-data/kor@1.0.0/4.0.0/kor.traineddata.gz

# Substitute any lang code from the registry above (or any tesseract code)
curl -O https://cdn.jsdelivr.net/npm/@tesseract.js-data/<LANG>@1.0.0/4.0.0/<LANG>.traineddata.gz

Language models — :best tier (LSTM-only, ~3 MB/lang gzipped)

URL template: https://cdn.jsdelivr.net/npm/@tesseract.js-data/<LANG>@1.0.0/4.0.0_best_int/<LANG>.traineddata.gz

# English (best, ~3 MB)
curl -O https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng@1.0.0/4.0.0_best_int/eng.traineddata.gz

# Japanese (best)
curl -O https://cdn.jsdelivr.net/npm/@tesseract.js-data/jpn@1.0.0/4.0.0_best_int/jpn.traineddata.gz

Where to put them for :local mode

Drop everything into your Phoenix app's priv/static/assets/vendor/tesseract/:

priv/static/assets/vendor/tesseract/
├── tesseract.min.js                      ← installed by `mix tesseract_js.install_assets`
├── worker.min.js                         ← installed by `mix tesseract_js.install_assets`
├── tesseract_js.umd.js                   ← installed by `mix tesseract_js.install_assets`
├── tesseract-core-simd-lstm.wasm.js      ← curl OR `mix tesseract_js.download --core-only`
├── eng.traineddata.gz                    ← curl OR `mix tesseract_js.download eng`
└── jpn_vert.traineddata.gz               ← curl OR `mix tesseract_js.download jpn_vert`

Then set config :tesseract_js, source: :local.

Generating any URL programmatically

iex> TesseractJs.Models.cdn_url("eng")
"https://cdn.jsdelivr.net/npm/@tesseract.js-data/eng@1.0.0/4.0.0/eng.traineddata.gz"

iex> TesseractJs.Models.cdn_url("jpn_vert", :best)
"https://cdn.jsdelivr.net/npm/@tesseract.js-data/jpn_vert@1.0.0/4.0.0_best_int/jpn_vert.traineddata.gz"

iex> TesseractJs.Models.core_cdn_url(:simd_lstm)
"https://cdn.jsdelivr.net/npm/tesseract.js-core@5.1.1/tesseract-core-simd-lstm.wasm.js"

JS API

window.TesseractJs.getOcrWorker(opts?)   // Promise<Worker>; singleton
window.TesseractJs.recognize(imageLike, opts?)
window.TesseractJs.resetWorker()         // terminate + clear singleton

imageLike is anything tesseract.js's own recognize() accepts: canvas, img, blob, URL, ImageData. opts overrides the inline defaults set by <.script />.

How it loads

File How it gets there
tesseract.min.js, worker.min.js, tesseract_js.umd.js shipped in the package — copied into your priv/static/ by mix tesseract_js.install_assets
tesseract-core-simd-lstm.wasm.js jsDelivr (CDN mode) or mix tesseract_js.download (local mode)
<lang>.traineddata.gz jsDelivr (CDN mode) or mix tesseract_js.download <langs> (local mode)

Tasks

task what it does
mix tesseract_js.install_assets copies the bundled JS into your priv/static/. Run once after install, again after upgrading the package
mix tesseract_js.download downloads core WASM + traineddata for local mode
mix tesseract_js.download --list prints the curated language registry

License

MIT.