nx_tflite_mob

Cross-platform TensorFlow Lite NIF for Mob apps. Loads a .tflite model, attaches the right per-platform delegate (NNAPI on Android, CoreML on iOS), and runs inference. Same Elixir API, same .tflite model file on both OSes.

On the Moto G Power 5G (2024) / Dimensity 7020 / IMG PowerVR BXM-8-256 this hits 155 ms YOLOv8n forward via the MediaTek mtk-gpu_shim accelerator — the headline that finally unlocks the chip after both NxVulkan (3.9s) and IREE (1.7s) stopped short.

Platforms

Target	Delegate path	Output
android_arm64	XNNPACK CPU / NNAPI (vendor GPU + NPU HALs)	`priv/android_arm64/libtflite_nif.{so,a}`
ios_device	XNNPACK CPU / CoreML (ANE) / Metal (GPU)	`priv/ios_device/libtflite_nif.a`
ios_sim	XNNPACK CPU (no ANE on simulator)	`priv/ios_sim/libtflite_nif.a`

Status

Surface	State
Standalone Android `bench` CLI: YOLOv8n via TFLite XNNPACK CPU INT8	✅ 273 ms mean / 202 ms min
Standalone Android `bench` CLI: YOLOv8n via TFLite + NNAPI(`mtk-gpu_shim`)	✅ 155 ms mean / 118 ms min
Standalone Android `bench` CLI: YOLOv8n via TFLite + NNAPI(`mtk-neuron_shim`) NPU	618 ms (partial NPU coverage, falls back)
C NIF compiles + cross-compiles for Android arm64	✅ `priv/android/libtflite_nif.so` (~16 KB)
NIF loads inside Mob's running BEAM	⏸ blocked: same `enif_*` namespace isolation we hit in nx_iree_mob

The full perf ladder on a budget Moto BXM-8-256

Stack	Median	vs original
Original NxVulkan (hand-rolled Vulkan compute)	3.9 s	1×
IREE CPU (f32, LLVM autovectorized)	2.07 s	1.9×
TFLite XNNPACK CPU (f32)	525 ms	7.4×
TFLite XNNPACK INT8 (QDQ, f32 input)	411 ms	9.5×
TFLite XNNPACK full_integer_quant (INT8 input)	273 ms	14×
TFLite + NNAPI → `mtk-gpu_shim`	155 ms	25×
TFLite + NNAPI → `mtk-neuron_shim` (APU NPU)	618 ms	6× — only partial op coverage

~155 ms = 6.5 FPS sustained on a budget Android, fully GPU-accelerated through the device's NNAPI HAL. This is the number Mob+Android+YOLO has been chasing across three backends.

Why TFLite + NNAPI wins where NxVulkan + IREE Vulkan didn't

The BXM-8-256's PowerVR driver caps at Vulkan 1.1. IREE's Vulkan HAL requires Vulkan 1.3 + timeline semaphores + scalarBlockLayout + synchronization2. So our direct Vulkan compute path was blocked at the runtime baseline check, regardless of how good the shaders were.

NNAPI sidesteps that — it routes through MediaTek's own NN HAL driver (mtk-gpu_shim), which talks to the PowerVR using vendor-specific code paths that don't go through the public Vulkan 1.3 surface.

Lesson: for non-flagship Android phones, the cleanest "use the GPU" path isn't to write Vulkan compute yourself — it's to compile to TFLite, run with NNAPI, and let the vendor's HAL choose the kernel.

Where the NPU went

mtk-neuron_shim exists and is the actual APU/MDLA, but TFLite + NNAPI running YOLOv8n on it lands at 618 ms — slower than the GPU path. The NPU only natively supports a subset of ops (mostly the conv-shaped ones); YOLO's post-processing (concat / reshape / non-max suppression) falls back to CPU with cross-device buffer transfers. The roundtrip swamps any per-op speedup.

A model designed end-to-end for the APU (no reshape/concat in the inference graph) would land much faster on mtk-neuron_shim. Doesn't apply to YOLOv8n as exported.

Standalone `bench` CLI

The Android benchmark used to produce all the numbers above lives in scripts/bench_android/ (a single C file bench.c + the TFLite AAR's .so and headers). Recipe:

# Pull TFLite 2.16.1 AAR (smaller than building TFLite from source).
mkdir -p /tmp/tflite && cd /tmp/tflite
curl -sLO https://repo1.maven.org/maven2/org/tensorflow/tensorflow-lite/2.16.1/tensorflow-lite-2.16.1.aar
unzip -q tensorflow-lite-2.16.1.aar -d aar
# Patch in two missing headers (AAR bug):
bash -c '
for path in "tensorflow/lite/core/c/registration_external.h" \
            "tensorflow/lite/core/async/c/types.h"; do
  mkdir -p "aar/headers/$(dirname $path)"
  curl -sL -o "aar/headers/$path" \
    "https://raw.githubusercontent.com/tensorflow/tensorflow/v2.16.1/$path"
done
'

# Cross-compile bench.c (in this repo's scripts/bench_android/).
ANDROID_NDK=/path/to/ndk/27.2.12479018
$ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android29-clang \
  -O2 -I aar/headers \
  bench.c aar/jni/arm64-v8a/libtensorflowlite_jni.so \
  -ldl -llog \
  -o bench

# Push + run on phone.
adb push bench libtensorflowlite_jni.so yolov8n.tflite input_int8.bin /data/local/tmp/tflite/
adb shell "cd /data/local/tmp/tflite && LD_LIBRARY_PATH=. ./bench \
  yolov8n_full_integer_quant.tflite input_int8.bin nnapi:mtk-gpu_shim"

# List available NNAPI accelerators on the device:
adb shell "cd /data/local/tmp/tflite && LD_LIBRARY_PATH=. ./bench list-nnapi"

The Mob NIF integration gap (same as nx_iree_mob)

Loading the NIF .so into Mob's running BEAM hits the same Bionic linker-namespace isolation that NxIree did:

dlopen failed: cannot locate symbol "enif_open_resource_type"
   referenced by libtflite_nif.so in namespace clns-7

The launcher (libnxeigen_probe.so for our test app) does export all 176 enif_* symbols (llvm-nm -D confirms), but a NIF loaded dynamically into the app's private namespace can't reach them. Same fix patterns:

Mob static-NIF integration (recommended) — extend mob_dev's rustler/Zig pipeline to handle a C-NIF-with-extra-libs entry. The C source + libtensorflowlite_jni.so get linked into the app launcher binary, ERTS symbols resolve at link time. Same pattern as nx_vulkan.
Pre-load libtensorflowlite_jni.so from the app launcher's JNI_OnLoad so it's already in the global namespace when our NIF tries to load.

Until either path lands, the standalone bench CLI is the way to measure the perf number, and the Elixir-side wrapper (lib/nx_tflite_mob.ex) is design-only for Mob integration.

Layout

c_src/tflite_nif.c          — NIF: load_module, call, release_module
lib/nx_tflite_mob.ex        — Elixir API + the @on_load NIF loader stub
Makefile                    — Android arm64 cross-compile
priv/android/libtflite_nif.so — built artifact (after `make android`)
scripts/bxm_tflite_sweep.sh — reproduce the full perf table on a device
docs/perf_history.md        — the per-stack numbers + analysis

License

Apache 2.0.