Binary Optimization
Minimizing binary size to fit within flash memory constraints.
Prerequisites: This chapter builds on Performance from Part 4. Review that chapter for general Rust optimization techniques before diving into embedded-specific strategies.
Embedded targets have strict flash memory budgets. The STM32F769 provides 2 MB of flash, which sounds generous until you add a display driver, networking stack, and cryptography library. Every kilobyte matters. This chapter covers systematic techniques for reducing binary size from Cargo profiles down to individual lines of code.
Cargo Profile Tuning
Cargo profiles control how rustc compiles your code. The right settings can reduce binary size by 50% or more with no code changes.
Profile Options Reference
| Option | Values | Effect | Size Impact |
|---|---|---|---|
opt-level |
0 |
No optimization (fast compile) | Baseline (largest) |
1 |
Basic optimizations | ~30% smaller than 0 |
|
2 |
Most optimizations | ~40% smaller than 0 |
|
3 |
All optimizations (speed priority) | Similar to 2, may be larger |
|
s |
Optimize for size | ~45% smaller than 0 |
|
z |
Aggressively optimize for size | ~50% smaller than 0 |
|
lto |
false / "off" |
No link-time optimization | Baseline |
"thin" |
Fast LTO across crates | 10-20% smaller | |
true / "fat" |
Full LTO (slow compile) | 15-25% smaller | |
codegen-units |
16 (default) |
Parallel compilation | Baseline |
1 |
Single codegen unit | 5-10% smaller (enables more inlining) | |
debug |
true / 2 |
Full debug info | No flash impact (debug info in ELF only) |
false / 0 |
No debug info | Smaller ELF file on disk | |
strip |
false |
Keep symbols | Baseline |
true |
Strip all symbols | Smaller ELF, no symbol names in debugger |
Recommended Profiles
# Cargo.toml
# Development: fast compile, debuggable
[profile.dev]
opt-level = 1 # Some optimization (default 0 is too slow on target)
debug = true # Full debug info for probe-rs / GDB
lto = false # Fast compilation
codegen-units = 16 # Maximum parallelism
# Release: smallest binary
[profile.release]
opt-level = "z" # Aggressively optimize for size
lto = "fat" # Full link-time optimization
codegen-units = 1 # Single unit for maximum optimization
debug = true # Keep debug info (does NOT increase flash usage)
strip = false # Keep symbols for cargo-size analysis
Always set
debug = trueeven in release builds. Debug information is stored in ELF sections that are not flashed to the target. You get full debugging capability with zero flash cost.
Real-World Size Comparison
Typical sizes for a minimal #[entry] blinky program on STM32F769:
| Profile | opt-level | LTO | codegen-units | Flash Usage |
|---|---|---|---|---|
| dev (default) | 0 | off | 16 | ~48 KB |
| dev (tuned) | 1 | off | 16 | ~22 KB |
| release (default) | 3 | off | 16 | ~18 KB |
| release (size) | z | fat | 1 | ~8 KB |
| release (size + strip) | z | fat | 1 | ~8 KB (same flash, smaller ELF) |
Panic Handler Strategies
The panic handler is a mandatory component in no_std programs. Your choice directly affects binary size and debuggability.
Panic Handler Comparison
| Crate | Behavior | Typical Size | Debuggability | Use Case |
|---|---|---|---|---|
panic-halt |
Infinite loop (loop {}) |
~0 bytes | None (silent hang) | Production, size-critical |
panic-abort |
Calls core::intrinsics::abort |
~0 bytes | Triggers fault (catchable by debugger) | Production |
panic-probe |
Hits breakpoint + defmt log |
~200-500 bytes | Excellent (message + location) | Development with probe-rs |
panic-semihosting |
Prints to debug host | ~2-4 KB | Good (full message on host) | Development with OpenOCD |
Configuration
// Choose ONE panic handler per binary
// Option 1: Minimal size — silent halt
use panic_halt as _;
// Option 2: Development — breakpoint with defmt message
use panic_probe as _;
// Option 3: Semihosting — full panic message to host console
use panic_semihosting as _;
Never use
panic-semihostingin production. If no debugger is attached, the semihosting call triggers a HardFault. Usepanic-haltorpanic-abortfor deployed firmware.
Binary Size Analysis Tools
Before optimizing, you need to measure. These cargo-binutils tools show exactly where your bytes are going.
cargo size — Section Breakdown
$ cargo size --release -- -A
binary-optimization :
section size addr
.vector_table 1024 0x8000000
.text 6472 0x8000400
.rodata 832 0x8001d38
.data 16 0x20000000
.bss 268 0x20000010
.uninit 1024 0x2000011c
Total 9636
Interpreting the output:
| Section | Location | What to Check |
|---|---|---|
.text |
Flash | Your code + library code. Largest section to optimize. |
.rodata |
Flash | String literals, constants, vtables. Watch for format strings. |
.vector_table |
Flash | Fixed size (varies by interrupt count). Cannot reduce. |
.data |
Flash + RAM | Initialized statics. Small is normal. |
.bss |
RAM only | Uninitialized statics. Does not consume flash. |
cargo bloat — Per-Function Analysis
$ cargo bloat --release -n 10
File .text Size Crate Name
5.8% 28.3% 1,832B core core::fmt::write
3.1% 15.1% 976B core core::fmt::Formatter::pad
2.4% 11.7% 756B my_crate my_crate::main
1.2% 5.9% 380B core core::fmt::num::<impl core::fmt::Display for u32>::fmt
0.8% 3.9% 252B std core::str::slice_error_fail
...
If
core::fmtfunctions dominate your bloat report, you have format string bloat. See Code-Level Optimizations below.
cargo nm — Symbol Listing
# List all symbols sorted by size (largest first)
$ cargo nm --release -- --size-sort --reverse | head -20
00000728 T core::fmt::write
000003d0 T core::fmt::Formatter::pad
000002f4 T my_crate::main
Use cargo nm to find specific symbols when cargo bloat does not provide enough detail.
Linker Script Optimization
The default cortex-m-rt linker script covers most use cases, but custom sections and careful placement can improve both size and performance.
ITCM Placement for Critical Code
On the STM32F769, ITCM (Instruction Tightly Coupled Memory) provides zero-wait-state instruction fetch. Place interrupt handlers and hot loops there:
/* Additions to memory.x for STM32F769 */
MEMORY
{
FLASH : ORIGIN = 0x08000000, LENGTH = 2M
RAM : ORIGIN = 0x20020000, LENGTH = 368K
ITCM : ORIGIN = 0x00000000, LENGTH = 16K
DTCM : ORIGIN = 0x20000000, LENGTH = 128K
}
SECTIONS
{
/* Place performance-critical code in ITCM */
.itcm : AT(__eitcm_load)
{
__sitcm = .;
*(.itcm .itcm.*);
. = ALIGN(4);
__eitcm = .;
} > ITCM
/* Load address in flash for ITCM initialization */
__eitcm_load = LOADADDR(.itcm) + SIZEOF(.itcm);
}
KEEP for Interrupt Vectors
The KEEP directive prevents the linker from discarding sections that appear unused:
SECTIONS
{
.vector_table ORIGIN(FLASH) :
{
KEEP(*(.vector_table));
KEEP(*(.vector_table.exceptions));
KEEP(*(.vector_table.interrupts));
} > FLASH
}
Without KEEP, link-time optimization might remove interrupt vectors that are only referenced by hardware, not by code.
Removing Unused Sections
Enable garbage collection of unused sections in .cargo/config.toml:
[target.thumbv7em-none-eabihf]
rustflags = [
"-C", "link-arg=-Wl,--gc-sections",
]
This works with LTO to remove dead code that survives Rust-level optimization but can be identified at link time.
Code-Level Optimizations
Profile settings only go so far. The biggest wins often come from avoiding patterns that pull in large chunks of core.
core::fmt Bloat
The core::fmt formatting machinery is the single largest source of unexpected binary size in embedded Rust. A single panic!("{}", x) can add 20 KB or more to your binary.
Why it happens:
// This pulls in core::fmt::write, Display for u32, padding, alignment...
panic!("Temperature {} exceeds max {}", temp, MAX_TEMP);
// Adds: ~20 KB of formatting code
// This uses no formatting machinery
panic!("Temperature exceeds maximum");
// Adds: ~0 bytes (just a string literal in .rodata)
| Pattern | Approximate Cost | Alternative |
|---|---|---|
panic!("{}", x) |
~20 KB | panic!("static message") |
write!(buf, "{}", x) |
~20 KB | defmt::write! or manual conversion |
format!("{}", x) |
N/A (requires alloc) | Avoid entirely in no_std |
defmt::info!("{}", x) |
~200 bytes | Use defmt for all logging |
Using defmt Instead of core::fmt
defmt (deferred formatting) moves formatting work to the host PC. The target only sends compact tokens:
// Instead of this (20+ KB):
use core::fmt::Write;
writeln!(uart, "Sensor reading: {}", value).ok();
// Use this (~200 bytes):
defmt::info!("Sensor reading: {}", value);
defmt achieves this by:
- Storing format strings in a separate ELF section (not flashed)
- Sending only a string index + raw argument bytes over the probe
- Reconstructing the full message on the host with
defmt-printorprobe-rs
Minimizing Generic Monomorphization
Every distinct type parameter creates a new copy of a generic function:
// This generic function is compiled THREE times:
fn process<T: Sensor>(sensor: &T) { /* ... */ }
process(&temperature_sensor); // process::<TemperatureSensor>
process(&humidity_sensor); // process::<HumiditySensor>
process(&pressure_sensor); // process::<PressureSensor>
Strategies to reduce monomorphization:
// Strategy 1: Use trait objects for non-hot-path code
fn process(sensor: &dyn Sensor) { /* ... */ }
// One copy, small vtable overhead, slight runtime cost
// Strategy 2: Extract non-generic inner function
fn process<T: Sensor>(sensor: &T) {
let reading = sensor.read(); // Generic (small, inlined)
process_reading(reading); // Non-generic (one copy)
}
fn process_reading(value: f32) {
// All the heavy logic lives here — compiled once
}
Inline Control
// Rarely-called error handlers: prevent inlining to keep callers small
#[inline(never)]
fn handle_error(code: u32) {
// Error handling logic — one copy in .text
}
// Hot inner loops: hint the compiler to inline
#[inline(always)]
fn fast_checksum(byte: u8, acc: u32) -> u32 {
acc.wrapping_add(byte as u32)
}
Patterns to Avoid
| Pattern | Problem | Alternative |
|---|---|---|
String / format!() |
Requires allocator, pulls in alloc | Fixed buffers, heapless::String |
panic!("{}", val) |
Pulls in core::fmt (~20 KB) |
panic!("static message") or defmt::panic! |
| Deep generic nesting | Exponential monomorphization | Trait objects, extract inner functions |
| Large match arms in generics | Each variant monomorphized per type | Factor out common logic |
Debug derive on large enums |
Format machinery for every variant | Manual defmt::Format impl |
Release Build Checklist
Follow this checklist before flashing a release build to verify your binary fits in flash and is production-ready.
Step 1: Profile Settings
# Verify Cargo.toml [profile.release]
[profile.release]
opt-level = "z"
lto = "fat"
codegen-units = 1
strip = false # Keep for analysis; enable for final production
Step 2: Panic Handler
// Development → switch to production handler
// use panic_probe as _; // Development
use panic_halt as _; // Production
Step 3: Dependency Audit
# Check which crates contribute the most to binary size
$ cargo bloat --release --crates
File .text Size Crate
10.2% 49.8% 3,224B core
4.1% 20.0% 1,296B stm32f7xx_hal
3.3% 16.1% 1,044B my_firmware
1.1% 5.4% 348B cortex_m
Review each dependency: is it pulling in more than expected? Check for feature flags that can disable unused functionality.
Step 4: Feature Flags
[dependencies]
# Disable default features and enable only what you need
stm32f7xx-hal = { version = "0.7", default-features = false, features = ["stm32f769", "rt"] }
Step 5: Size Verification
# Check total flash usage
$ cargo size --release -- -A | grep Total
Total 9636
# Compare against flash capacity
# STM32F769: 2,097,152 bytes (2 MB)
# Usage: 9,636 bytes (0.46%)
Quick Reference Table
| Step | Command / Action | Target |
|---|---|---|
| Profile | Set opt-level = "z", lto = "fat", codegen-units = 1 |
Cargo.toml |
| Panic handler | Switch to panic-halt or panic-abort |
src/main.rs |
| Dependency audit | cargo bloat --release --crates |
Terminal |
| Feature flags | Disable default features, enable selectively | Cargo.toml |
| Size check | cargo size --release -- -A |
Terminal |
| Flash capacity | Verify total < target flash size | Datasheet |
Best Practices
- Measure before optimizing — use
cargo sizeandcargo bloatto identify the actual largest contributors before making code changes - Set
debug = truein release — debug info stays in the ELF file and is not written to flash, so you get free debuggability - Use
defmtfor all logging — it replacescore::fmtwith a minimal on-target footprint and full formatting on the host - Prefer
opt-level = "z"over"s"—"z"is almost always smaller for embedded targets;"s"occasionally produces faster code at slightly larger size - Enable LTO for release —
"fat"LTO enables cross-crate dead code elimination that is impossible without it - Audit dependencies regularly — a single dependency pulling in
core::fmtor an allocator can undo all your optimization work - Use
#[inline(never)]on error paths — error handling code is rarely executed; preventing inlining keeps hot paths compact - Avoid
Stringandformat!entirely — useheapless::Stringor fixed-size buffers for any string manipulation inno_std
Next Steps
With your binary optimized for size and performance, learn how to handle concurrency without an OS in Async and Concurrency.