nx_tflite_mob
Cross-platform TensorFlow Lite NIF for Mob apps. Loads a .tflite
model, attaches the right per-platform delegate (NNAPI on Android,
CoreML on iOS), and runs inference. Same Elixir API, same .tflite
model file on both OSes.
On the Moto G Power 5G (2024) / Dimensity 7020 / IMG PowerVR BXM-8-256
this hits 155 ms YOLOv8n forward via the MediaTek mtk-gpu_shim
accelerator — the headline that finally unlocks the chip after both
NxVulkan (3.9s) and IREE (1.7s) stopped short.
Platforms
| Target | Delegate path | Output |
|---|---|---|
| android_arm64 | XNNPACK CPU / NNAPI (vendor GPU + NPU HALs) | priv/android_arm64/libtflite_nif.{so,a} |
| ios_device | XNNPACK CPU / CoreML (ANE) / Metal (GPU) | priv/ios_device/libtflite_nif.a |
| ios_sim | XNNPACK CPU (no ANE on simulator) | priv/ios_sim/libtflite_nif.a |
Status
| Surface | State |
|---|---|
Standalone Android bench CLI: YOLOv8n via TFLite XNNPACK CPU INT8 | ✅ 273 ms mean / 202 ms min |
Standalone Android bench CLI: YOLOv8n via TFLite + NNAPI(mtk-gpu_shim) | ✅ 155 ms mean / 118 ms min |
Standalone Android bench CLI: YOLOv8n via TFLite + NNAPI(mtk-neuron_shim) NPU | 618 ms (partial NPU coverage, falls back) |
| C NIF compiles + cross-compiles for Android arm64 |
✅ priv/android/libtflite_nif.so (~16 KB) |
| NIF loads inside Mob's running BEAM |
⏸ blocked: same enif_* namespace isolation we hit in nx_iree_mob |
The full perf ladder on a budget Moto BXM-8-256
| Stack | Median | vs original |
|---|---|---|
| Original NxVulkan (hand-rolled Vulkan compute) | 3.9 s | 1× |
| IREE CPU (f32, LLVM autovectorized) | 2.07 s | 1.9× |
| TFLite XNNPACK CPU (f32) | 525 ms | 7.4× |
| TFLite XNNPACK INT8 (QDQ, f32 input) | 411 ms | 9.5× |
| TFLite XNNPACK full_integer_quant (INT8 input) | 273 ms | 14× |
TFLite + NNAPI → mtk-gpu_shim | 155 ms | 25× |
TFLite + NNAPI → mtk-neuron_shim (APU NPU) | 618 ms | 6× — only partial op coverage |
~155 ms = 6.5 FPS sustained on a budget Android, fully GPU-accelerated through the device's NNAPI HAL. This is the number Mob+Android+YOLO has been chasing across three backends.
Why TFLite + NNAPI wins where NxVulkan + IREE Vulkan didn't
The BXM-8-256's PowerVR driver caps at Vulkan 1.1. IREE's Vulkan HAL requires Vulkan 1.3 + timeline semaphores + scalarBlockLayout + synchronization2. So our direct Vulkan compute path was blocked at the runtime baseline check, regardless of how good the shaders were.
NNAPI sidesteps that — it routes through MediaTek's own NN HAL driver
(mtk-gpu_shim), which talks to the PowerVR using vendor-specific code
paths that don't go through the public Vulkan 1.3 surface.
Lesson: for non-flagship Android phones, the cleanest "use the GPU" path isn't to write Vulkan compute yourself — it's to compile to TFLite, run with NNAPI, and let the vendor's HAL choose the kernel.
Where the NPU went
mtk-neuron_shim exists and is the actual APU/MDLA, but TFLite + NNAPI
running YOLOv8n on it lands at 618 ms — slower than the GPU path. The
NPU only natively supports a subset of ops (mostly the conv-shaped
ones); YOLO's post-processing (concat / reshape / non-max suppression)
falls back to CPU with cross-device buffer transfers. The roundtrip
swamps any per-op speedup.
A model designed end-to-end for the APU (no reshape/concat in the
inference graph) would land much faster on mtk-neuron_shim. Doesn't
apply to YOLOv8n as exported.
Standalone bench CLI
The Android benchmark used to produce all the numbers above lives in
scripts/bench_android/ (a single C file bench.c + the TFLite AAR's
.so and headers). Recipe:
# Pull TFLite 2.16.1 AAR (smaller than building TFLite from source).
mkdir -p /tmp/tflite && cd /tmp/tflite
curl -sLO https://repo1.maven.org/maven2/org/tensorflow/tensorflow-lite/2.16.1/tensorflow-lite-2.16.1.aar
unzip -q tensorflow-lite-2.16.1.aar -d aar
# Patch in two missing headers (AAR bug):
bash -c '
for path in "tensorflow/lite/core/c/registration_external.h" \
"tensorflow/lite/core/async/c/types.h"; do
mkdir -p "aar/headers/$(dirname $path)"
curl -sL -o "aar/headers/$path" \
"https://raw.githubusercontent.com/tensorflow/tensorflow/v2.16.1/$path"
done
'
# Cross-compile bench.c (in this repo's scripts/bench_android/).
ANDROID_NDK=/path/to/ndk/27.2.12479018
$ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android29-clang \
-O2 -I aar/headers \
bench.c aar/jni/arm64-v8a/libtensorflowlite_jni.so \
-ldl -llog \
-o bench
# Push + run on phone.
adb push bench libtensorflowlite_jni.so yolov8n.tflite input_int8.bin /data/local/tmp/tflite/
adb shell "cd /data/local/tmp/tflite && LD_LIBRARY_PATH=. ./bench \
yolov8n_full_integer_quant.tflite input_int8.bin nnapi:mtk-gpu_shim"
# List available NNAPI accelerators on the device:
adb shell "cd /data/local/tmp/tflite && LD_LIBRARY_PATH=. ./bench list-nnapi"The Mob NIF integration gap (same as nx_iree_mob)
Loading the NIF .so into Mob's running BEAM hits the same Bionic linker-namespace isolation that NxIree did:
dlopen failed: cannot locate symbol "enif_open_resource_type"
referenced by libtflite_nif.so in namespace clns-7
The launcher (libnxeigen_probe.so for our test app) does export all
176 enif_* symbols (llvm-nm -D confirms), but a NIF loaded
dynamically into the app's private namespace can't reach them. Same
fix patterns:
- Mob static-NIF integration (recommended) — extend
mob_dev's rustler/Zig pipeline to handle a C-NIF-with-extra-libs entry. The C source +libtensorflowlite_jni.soget linked into the app launcher binary, ERTS symbols resolve at link time. Same pattern asnx_vulkan. - Pre-load
libtensorflowlite_jni.sofrom the app launcher'sJNI_OnLoadso it's already in the global namespace when our NIF tries to load.
Until either path lands, the standalone bench CLI is the way to measure
the perf number, and the Elixir-side wrapper (lib/nx_tflite_mob.ex)
is design-only for Mob integration.
Layout
c_src/tflite_nif.c — NIF: load_module, call, release_module
lib/nx_tflite_mob.ex — Elixir API + the @on_load NIF loader stub
Makefile — Android arm64 cross-compile
priv/android/libtflite_nif.so — built artifact (after `make android`)
scripts/bxm_tflite_sweep.sh — reproduce the full perf table on a device
docs/perf_history.md — the per-stack numbers + analysisLicense
Apache 2.0.