DeepMetal

edge ai compiler for high level deep learning frameworks

UofT Hacker Fab Team

June 2025

- 6 min read

we built deepmetal because we kept running into the same wall: you can’t really run machine learning on tiny microcontrollers without hitting memory and speed limits. most of the time you’re stuck with big frameworks that don’t fit, or you end up hand-coding tiny models just to make them run. we wanted something in between, a way to take a small pytorch model and turn it into code that actually works on chips like the stm32f446re.

the pipeline is simple. we train or export a model with export_model.py, usually mnist-scale linear, conv, or hybrid networks. that gives us .pth files and state dicts. then we push it through one of three converters. the c converter makes pure c with static memory and ping pong buffers tuned for cortex-m4. the llvm ir converter emits .ll and object files so we can run cross-platform on x86, aarch64, or risc v. the c++ converter builds templates and a model_config.json for metadata. once the code is generated, we compile with the right flags for the mcu and drop it into the stm32 demo projects we wrote, with led blink, uart, and simple nn inference.

the repo is set up with python scripts and shell helpers for training, exporting, and compiling. the embedded demo sits in backend/src/stm32_project with linker scripts, startup files, uart, led, and demo variants. we also added a small react/vite frontend to show results, but it’s more extra than core.

we made some design choices early. no malloc, only static allocation with ping pong buffers, capped at max_buffer_size = 2048. operators are limited to linear, conv2d, and relu. no pooling, batch norm, or softmax yet. cortex-m4 is the main target, but the llvm ir path works across platforms. the demo firmware sets up gpio, usart2, and enables the fpu so everything runs at full speed.

it wasn’t easy. memory planning was the hardest part. static allocation works but it’s clunky, and a real planner would be better. operator coverage is thin, so more ops like pooling and batch norm folding are needed. quantization and cmsis-nn aren’t in yet, but they’d cut down ram and flash and make things faster. benchmarks are rough too, we only have ballpark numbers. even the frontend could use better docs to explain how it ties into the converters.

our roadmap is clear. add onnx export with a fixed op subset, fold batch norm into conv weights, support simple pooling, build a smarter memory planner, and add quantization (q7/q15 and cmsis-nn). we also want parity tests with checked-in tensors and expected logits, plus a timing harness. the docs need a design page with diagrams showing the flow and memory layout.

so far the models look solid. linear nets hit around 95 percent with ~107k params, conv nets 98 percent with ~23k, hybrids 97 percent with ~15k. on an 80 mhz cortex-m4, latency is about 2 ms for linear, 5 ms for hybrid, 8 ms for conv. ram use is 2–64 kb and flash 15–400 kb.

the first working version landed june 21 with the full backend, stm32 project, and scripts. june 22 we cleaned up the pipeline, added frontend updates, and a small mnist flask demo. by june 26 we had polished the readme and added repo badges. the code and history are all up on github if you want to see it.