Binary Optimization

Minimizing binary size to fit within flash memory constraints.

Prerequisites: This chapter builds on Performance from Part 4. Review that chapter for general Rust optimization techniques before diving into embedded-specific strategies.

Embedded targets have strict flash memory budgets. The STM32F769 provides 2 MB of flash, which sounds generous until you add a display driver, networking stack, and cryptography library. Every kilobyte matters. This chapter covers systematic techniques for reducing binary size from Cargo profiles down to individual lines of code.

Cargo Profile Tuning

Cargo profiles control how rustc compiles your code. The right settings can reduce binary size by 50% or more with no code changes.

Profile Options Reference

Option	Values	Effect	Size Impact
`opt-level`	`0`	No optimization (fast compile)	Baseline (largest)
	`1`	Basic optimizations	~30% smaller than `0`
	`2`	Most optimizations	~40% smaller than `0`
	`3`	All optimizations (speed priority)	Similar to `2`, may be larger
	`s`	Optimize for size	~45% smaller than `0`
	`z`	Aggressively optimize for size	~50% smaller than `0`
`lto`	`false` / `"off"`	No link-time optimization	Baseline
	`"thin"`	Fast LTO across crates	10-20% smaller
	`true` / `"fat"`	Full LTO (slow compile)	15-25% smaller
`codegen-units`	`16` (default)	Parallel compilation	Baseline
	`1`	Single codegen unit	5-10% smaller (enables more inlining)
`debug`	`true` / `2`	Full debug info	No flash impact (debug info in ELF only)
	`false` / `0`	No debug info	Smaller ELF file on disk
`strip`	`false`	Keep symbols	Baseline
	`true`	Strip all symbols	Smaller ELF, no symbol names in debugger

Recommended Profiles

# Cargo.toml

# Development: fast compile, debuggable
[profile.dev]
opt-level = 1          # Some optimization (default 0 is too slow on target)
debug = true           # Full debug info for probe-rs / GDB
lto = false            # Fast compilation
codegen-units = 16     # Maximum parallelism

# Release: smallest binary
[profile.release]
opt-level = "z"        # Aggressively optimize for size
lto = "fat"            # Full link-time optimization
codegen-units = 1      # Single unit for maximum optimization
debug = true           # Keep debug info (does NOT increase flash usage)
strip = false          # Keep symbols for cargo-size analysis

Always set debug = true even in release builds. Debug information is stored in ELF sections that are not flashed to the target. You get full debugging capability with zero flash cost.

Real-World Size Comparison

Typical sizes for a minimal #[entry] blinky program on STM32F769:

Profile	opt-level	LTO	codegen-units	Flash Usage
dev (default)	0	off	16	~48 KB
dev (tuned)	1	off	16	~22 KB
release (default)	3	off	16	~18 KB
release (size)	z	fat	1	~8 KB
release (size + strip)	z	fat	1	~8 KB (same flash, smaller ELF)

Panic Handler Strategies

The panic handler is a mandatory component in no_std programs. Your choice directly affects binary size and debuggability.

Panic Handler Comparison

Crate	Behavior	Typical Size	Debuggability	Use Case
`panic-halt`	Infinite loop (`loop {}`)	~0 bytes	None (silent hang)	Production, size-critical
`panic-abort`	Calls `core::intrinsics::abort`	~0 bytes	Triggers fault (catchable by debugger)	Production
`panic-probe`	Hits breakpoint + `defmt` log	~200-500 bytes	Excellent (message + location)	Development with probe-rs
`panic-semihosting`	Prints to debug host	~2-4 KB	Good (full message on host)	Development with OpenOCD

Configuration

// Choose ONE panic handler per binary

// Option 1: Minimal size — silent halt
use panic_halt as _;

// Option 2: Development — breakpoint with defmt message
use panic_probe as _;

// Option 3: Semihosting — full panic message to host console
use panic_semihosting as _;

Never use panic-semihosting in production. If no debugger is attached, the semihosting call triggers a HardFault. Use panic-halt or panic-abort for deployed firmware.

Binary Size Analysis Tools

Before optimizing, you need to measure. These cargo-binutils tools show exactly where your bytes are going.

cargo size — Section Breakdown

$ cargo size --release -- -A
binary-optimization  :
section               size      addr
.vector_table          1024  0x8000000
.text                  6472  0x8000400
.rodata                 832  0x8001d38
.data                    16  0x20000000
.bss                    268  0x20000010
.uninit                1024  0x2000011c
Total                  9636

Interpreting the output:

Section	Location	What to Check
`.text`	Flash	Your code + library code. Largest section to optimize.
`.rodata`	Flash	String literals, constants, vtables. Watch for format strings.
`.vector_table`	Flash	Fixed size (varies by interrupt count). Cannot reduce.
`.data`	Flash + RAM	Initialized statics. Small is normal.
`.bss`	RAM only	Uninitialized statics. Does not consume flash.

cargo bloat — Per-Function Analysis

$ cargo bloat --release -n 10
 File  .text    Size      Crate Name
 5.8%  28.3%  1,832B      core  core::fmt::write
 3.1%  15.1%    976B      core  core::fmt::Formatter::pad
 2.4%  11.7%    756B  my_crate  my_crate::main
 1.2%   5.9%    380B      core  core::fmt::num::<impl core::fmt::Display for u32>::fmt
 0.8%   3.9%    252B       std  core::str::slice_error_fail
 ...

If core::fmt functions dominate your bloat report, you have format string bloat. See Code-Level Optimizations below.

cargo nm — Symbol Listing

# List all symbols sorted by size (largest first)
$ cargo nm --release -- --size-sort --reverse | head -20
00000728 T core::fmt::write
000003d0 T core::fmt::Formatter::pad
000002f4 T my_crate::main

Use cargo nm to find specific symbols when cargo bloat does not provide enough detail.

Linker Script Optimization

The default cortex-m-rt linker script covers most use cases, but custom sections and careful placement can improve both size and performance.

ITCM Placement for Critical Code

On the STM32F769, ITCM (Instruction Tightly Coupled Memory) provides zero-wait-state instruction fetch. Place interrupt handlers and hot loops there:

/* Additions to memory.x for STM32F769 */
MEMORY
{
    FLASH  : ORIGIN = 0x08000000, LENGTH = 2M
    RAM    : ORIGIN = 0x20020000, LENGTH = 368K
    ITCM   : ORIGIN = 0x00000000, LENGTH = 16K
    DTCM   : ORIGIN = 0x20000000, LENGTH = 128K
}

SECTIONS
{
    /* Place performance-critical code in ITCM */
    .itcm : AT(__eitcm_load)
    {
        __sitcm = .;
        *(.itcm .itcm.*);
        . = ALIGN(4);
        __eitcm = .;
    } > ITCM

    /* Load address in flash for ITCM initialization */
    __eitcm_load = LOADADDR(.itcm) + SIZEOF(.itcm);
}

KEEP for Interrupt Vectors

The KEEP directive prevents the linker from discarding sections that appear unused:

SECTIONS
{
    .vector_table ORIGIN(FLASH) :
    {
        KEEP(*(.vector_table));
        KEEP(*(.vector_table.exceptions));
        KEEP(*(.vector_table.interrupts));
    } > FLASH
}

Without KEEP, link-time optimization might remove interrupt vectors that are only referenced by hardware, not by code.

Removing Unused Sections

Enable garbage collection of unused sections in .cargo/config.toml:

[target.thumbv7em-none-eabihf]
rustflags = [
    "-C", "link-arg=-Wl,--gc-sections",
]

This works with LTO to remove dead code that survives Rust-level optimization but can be identified at link time.

Code-Level Optimizations

Profile settings only go so far. The biggest wins often come from avoiding patterns that pull in large chunks of core.

core::fmt Bloat

The core::fmt formatting machinery is the single largest source of unexpected binary size in embedded Rust. A single panic!("{}", x) can add 20 KB or more to your binary.

Why it happens:

// This pulls in core::fmt::write, Display for u32, padding, alignment...
panic!("Temperature {} exceeds max {}", temp, MAX_TEMP);
// Adds: ~20 KB of formatting code

// This uses no formatting machinery
panic!("Temperature exceeds maximum");
// Adds: ~0 bytes (just a string literal in .rodata)

Pattern	Approximate Cost	Alternative
`panic!("{}", x)`	~20 KB	`panic!("static message")`
`write!(buf, "{}", x)`	~20 KB	`defmt::write!` or manual conversion
`format!("{}", x)`	N/A (requires alloc)	Avoid entirely in `no_std`
`defmt::info!("{}", x)`	~200 bytes	Use `defmt` for all logging

Using defmt Instead of core::fmt

defmt (deferred formatting) moves formatting work to the host PC. The target only sends compact tokens:

// Instead of this (20+ KB):
use core::fmt::Write;
writeln!(uart, "Sensor reading: {}", value).ok();

// Use this (~200 bytes):
defmt::info!("Sensor reading: {}", value);

defmt achieves this by:

Storing format strings in a separate ELF section (not flashed)
Sending only a string index + raw argument bytes over the probe
Reconstructing the full message on the host with defmt-print or probe-rs

Minimizing Generic Monomorphization

Every distinct type parameter creates a new copy of a generic function:

// This generic function is compiled THREE times:
fn process<T: Sensor>(sensor: &T) { /* ... */ }

process(&temperature_sensor);  // process::<TemperatureSensor>
process(&humidity_sensor);     // process::<HumiditySensor>
process(&pressure_sensor);     // process::<PressureSensor>

Strategies to reduce monomorphization:

// Strategy 1: Use trait objects for non-hot-path code
fn process(sensor: &dyn Sensor) { /* ... */ }
// One copy, small vtable overhead, slight runtime cost

// Strategy 2: Extract non-generic inner function
fn process<T: Sensor>(sensor: &T) {
    let reading = sensor.read();     // Generic (small, inlined)
    process_reading(reading);        // Non-generic (one copy)
}

fn process_reading(value: f32) {
    // All the heavy logic lives here — compiled once
}

Inline Control

// Rarely-called error handlers: prevent inlining to keep callers small
#[inline(never)]
fn handle_error(code: u32) {
    // Error handling logic — one copy in .text
}

// Hot inner loops: hint the compiler to inline
#[inline(always)]
fn fast_checksum(byte: u8, acc: u32) -> u32 {
    acc.wrapping_add(byte as u32)
}

Patterns to Avoid

Pattern	Problem	Alternative
`String` / `format!()`	Requires allocator, pulls in alloc	Fixed buffers, `heapless::String`
`panic!("{}", val)`	Pulls in `core::fmt` (~20 KB)	`panic!("static message")` or `defmt::panic!`
Deep generic nesting	Exponential monomorphization	Trait objects, extract inner functions
Large match arms in generics	Each variant monomorphized per type	Factor out common logic
`Debug` derive on large enums	Format machinery for every variant	Manual `defmt::Format` impl

Release Build Checklist

Follow this checklist before flashing a release build to verify your binary fits in flash and is production-ready.

Step 1: Profile Settings

# Verify Cargo.toml [profile.release]
[profile.release]
opt-level = "z"
lto = "fat"
codegen-units = 1
strip = false          # Keep for analysis; enable for final production

Step 2: Panic Handler

// Development → switch to production handler
// use panic_probe as _;      // Development
use panic_halt as _;          // Production

Step 3: Dependency Audit

# Check which crates contribute the most to binary size
$ cargo bloat --release --crates
 File  .text    Size Crate
10.2%  49.8%  3,224B core
 4.1%  20.0%  1,296B stm32f7xx_hal
 3.3%  16.1%  1,044B my_firmware
 1.1%   5.4%    348B cortex_m

Review each dependency: is it pulling in more than expected? Check for feature flags that can disable unused functionality.

Step 4: Feature Flags

[dependencies]
# Disable default features and enable only what you need
stm32f7xx-hal = { version = "0.7", default-features = false, features = ["stm32f769", "rt"] }

Step 5: Size Verification

# Check total flash usage
$ cargo size --release -- -A | grep Total
Total                  9636

# Compare against flash capacity
# STM32F769: 2,097,152 bytes (2 MB)
# Usage: 9,636 bytes (0.46%)

Quick Reference Table

Step	Command / Action	Target
Profile	Set `opt-level = "z"`, `lto = "fat"`, `codegen-units = 1`	Cargo.toml
Panic handler	Switch to `panic-halt` or `panic-abort`	`src/main.rs`
Dependency audit	`cargo bloat --release --crates`	Terminal
Feature flags	Disable default features, enable selectively	Cargo.toml
Size check	`cargo size --release -- -A`	Terminal
Flash capacity	Verify total < target flash size	Datasheet

Best Practices

Measure before optimizing — use cargo size and cargo bloat to identify the actual largest contributors before making code changes
Set debug = true in release — debug info stays in the ELF file and is not written to flash, so you get free debuggability
Use defmt for all logging — it replaces core::fmt with a minimal on-target footprint and full formatting on the host
Prefer opt-level = "z" over "s" — "z" is almost always smaller for embedded targets; "s" occasionally produces faster code at slightly larger size
Enable LTO for release — "fat" LTO enables cross-crate dead code elimination that is impossible without it
Audit dependencies regularly — a single dependency pulling in core::fmt or an allocator can undo all your optimization work
Use #[inline(never)] on error paths — error handling code is rarely executed; preventing inlining keeps hot paths compact
Avoid String and format! entirely — use heapless::String or fixed-size buffers for any string manipulation in no_std

Next Steps

With your binary optimized for size and performance, learn how to handle concurrency without an OS in Async and Concurrency.

Example Code