nx_tflite_mob

Cross-platform TensorFlow Lite NIF for Mob apps. Loads a .tflite model, attaches the right per-platform delegate (NNAPI on Android, CoreML on iOS), and runs inference. Same Elixir API, same .tflite model file on both OSes.

On the Moto G Power 5G (2024) / Dimensity 7020 / IMG PowerVR BXM-8-256 this hits 155 ms YOLOv8n forward via the MediaTek mtk-gpu_shim accelerator — the headline that finally unlocks the chip after both NxVulkan (3.9s) and IREE (1.7s) stopped short.

Platforms

Target Delegate path Output
android_arm64 XNNPACK CPU / NNAPI (vendor GPU + NPU HALs) priv/android_arm64/libtflite_nif.{so,a}
ios_device XNNPACK CPU / CoreML (ANE) / Metal (GPU) priv/ios_device/libtflite_nif.a
ios_sim XNNPACK CPU (no ANE on simulator) priv/ios_sim/libtflite_nif.a

Status

Surface State
Standalone Android bench CLI: YOLOv8n via TFLite XNNPACK CPU INT8 273 ms mean / 202 ms min
Standalone Android bench CLI: YOLOv8n via TFLite + NNAPI(mtk-gpu_shim) 155 ms mean / 118 ms min
Standalone Android bench CLI: YOLOv8n via TFLite + NNAPI(mtk-neuron_shim) NPU 618 ms (partial NPU coverage, falls back)
C NIF compiles + cross-compiles for Android arm64priv/android/libtflite_nif.so (~16 KB)
NIF loads inside Mob's running BEAM ⏸ blocked: same enif_* namespace isolation we hit in nx_iree_mob

The full perf ladder on a budget Moto BXM-8-256

Stack Median vs original
Original NxVulkan (hand-rolled Vulkan compute) 3.9 s
IREE CPU (f32, LLVM autovectorized) 2.07 s 1.9×
TFLite XNNPACK CPU (f32) 525 ms 7.4×
TFLite XNNPACK INT8 (QDQ, f32 input) 411 ms 9.5×
TFLite XNNPACK full_integer_quant (INT8 input) 273 ms 14×
TFLite + NNAPI → mtk-gpu_shim155 ms25×
TFLite + NNAPI → mtk-neuron_shim (APU NPU) 618 ms 6× — only partial op coverage

~155 ms = 6.5 FPS sustained on a budget Android, fully GPU-accelerated through the device's NNAPI HAL. This is the number Mob+Android+YOLO has been chasing across three backends.

Why TFLite + NNAPI wins where NxVulkan + IREE Vulkan didn't

The BXM-8-256's PowerVR driver caps at Vulkan 1.1. IREE's Vulkan HAL requires Vulkan 1.3 + timeline semaphores + scalarBlockLayout + synchronization2. So our direct Vulkan compute path was blocked at the runtime baseline check, regardless of how good the shaders were.

NNAPI sidesteps that — it routes through MediaTek's own NN HAL driver (mtk-gpu_shim), which talks to the PowerVR using vendor-specific code paths that don't go through the public Vulkan 1.3 surface.

Lesson: for non-flagship Android phones, the cleanest "use the GPU" path isn't to write Vulkan compute yourself — it's to compile to TFLite, run with NNAPI, and let the vendor's HAL choose the kernel.

Where the NPU went

mtk-neuron_shim exists and is the actual APU/MDLA, but TFLite + NNAPI running YOLOv8n on it lands at 618 ms — slower than the GPU path. The NPU only natively supports a subset of ops (mostly the conv-shaped ones); YOLO's post-processing (concat / reshape / non-max suppression) falls back to CPU with cross-device buffer transfers. The roundtrip swamps any per-op speedup.

A model designed end-to-end for the APU (no reshape/concat in the inference graph) would land much faster on mtk-neuron_shim. Doesn't apply to YOLOv8n as exported.

Standalone bench CLI

The Android benchmark used to produce all the numbers above lives in scripts/bench_android/ (a single C file bench.c + the TFLite AAR's .so and headers). Recipe:

# Pull TFLite 2.16.1 AAR (smaller than building TFLite from source).
mkdir -p /tmp/tflite && cd /tmp/tflite
curl -sLO https://repo1.maven.org/maven2/org/tensorflow/tensorflow-lite/2.16.1/tensorflow-lite-2.16.1.aar
unzip -q tensorflow-lite-2.16.1.aar -d aar
# Patch in two missing headers (AAR bug):
bash -c '
for path in "tensorflow/lite/core/c/registration_external.h" \
            "tensorflow/lite/core/async/c/types.h"; do
  mkdir -p "aar/headers/$(dirname $path)"
  curl -sL -o "aar/headers/$path" \
    "https://raw.githubusercontent.com/tensorflow/tensorflow/v2.16.1/$path"
done
'

# Cross-compile bench.c (in this repo's scripts/bench_android/).
ANDROID_NDK=/path/to/ndk/27.2.12479018
$ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android29-clang \
  -O2 -I aar/headers \
  bench.c aar/jni/arm64-v8a/libtensorflowlite_jni.so \
  -ldl -llog \
  -o bench

# Push + run on phone.
adb push bench libtensorflowlite_jni.so yolov8n.tflite input_int8.bin /data/local/tmp/tflite/
adb shell "cd /data/local/tmp/tflite && LD_LIBRARY_PATH=. ./bench \
  yolov8n_full_integer_quant.tflite input_int8.bin nnapi:mtk-gpu_shim"

# List available NNAPI accelerators on the device:
adb shell "cd /data/local/tmp/tflite && LD_LIBRARY_PATH=. ./bench list-nnapi"

The Mob NIF integration gap (same as nx_iree_mob)

Loading the NIF .so into Mob's running BEAM hits the same Bionic linker-namespace isolation that NxIree did:

dlopen failed: cannot locate symbol "enif_open_resource_type"
   referenced by libtflite_nif.so in namespace clns-7

The launcher (libnxeigen_probe.so for our test app) does export all 176 enif_* symbols (llvm-nm -D confirms), but a NIF loaded dynamically into the app's private namespace can't reach them. Same fix patterns:

Until either path lands, the standalone bench CLI is the way to measure the perf number, and the Elixir-side wrapper (lib/nx_tflite_mob.ex) is design-only for Mob integration.

Layout

c_src/tflite_nif.c          — NIF: load_module, call, release_module
lib/nx_tflite_mob.ex        — Elixir API + the @on_load NIF loader stub
Makefile                    — Android arm64 cross-compile
priv/android/libtflite_nif.so — built artifact (after `make android`)
scripts/bxm_tflite_sweep.sh — reproduce the full perf table on a device
docs/perf_history.md        — the per-stack numbers + analysis

License

Apache 2.0.