Performance

Profiling, benchmarking, optimization techniques, and performance best practices.

Overview

Rust’s zero-cost abstractions and control over memory layout enable high-performance code. Understanding profiling, benchmarking, and optimization techniques helps you write fast, efficient programs.

flowchart TB
    subgraph "Performance Workflow"
        M[Measure]
        I[Identify Bottlenecks]
        O[Optimize]
        V[Verify]
    end

    M -->|Profile| I
    I -->|Target| O
    O -->|Benchmark| V
    V -->|Repeat| M

    style M fill:#e3f2fd
    style I fill:#fff3e0
    style O fill:#c8e6c9
    style V fill:#f3e5f5

When to Optimize

“Premature optimization is the root of all evil” - Donald Knuth

First: Write clear, correct code

Then: Measure to find actual bottlenecks

Finally: Optimize the hot paths

flowchart TD
    A[Is it fast enough?] -->|Yes| B[Don't optimize!]
    A -->|No| C[Profile first]

    C --> D{Where is time spent?}
    D --> E[Hot path identified]
    E --> F[Optimize that code]
    F --> G[Measure again]
    G --> A

    style B fill:#c8e6c9
    style C fill:#fff3e0

Benchmarking with Criterion

Criterion provides statistical benchmarking with confidence intervals.

Setup

# Cargo.toml
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }

[[bench]]
name = "my_benchmark"
harness = false

Basic Benchmark

// benches/my_benchmark.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn fibonacci_recursive(n: u64) -> u64 {
    match n {
        0 => 0,
        1 => 1,
        n => fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2),
    }
}

fn fibonacci_iterative(n: u64) -> u64 {
    let mut a = 0;
    let mut b = 1;
    for _ in 0..n {
        let temp = a;
        a = b;
        b = temp + b;
    }
    a
}

fn benchmark_fibonacci(c: &mut Criterion) {
    let mut group = c.benchmark_group("Fibonacci");

    group.bench_function("recursive_20", |b| {
        b.iter(|| fibonacci_recursive(black_box(20)))
    });

    group.bench_function("iterative_20", |b| {
        b.iter(|| fibonacci_iterative(black_box(20)))
    });

    group.finish();
}

criterion_group!(benches, benchmark_fibonacci);
criterion_main!(benches);

Running Benchmarks

# Run benchmarks
cargo bench

# Run specific benchmark
cargo bench -- fibonacci

# Generate HTML report
cargo bench -- --save-baseline main

Comparing Implementations

use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};

fn bench_sorting(c: &mut Criterion) {
    let mut group = c.benchmark_group("Sorting");

    for size in [100, 1000, 10000].iter() {
        let data: Vec<i32> = (0..*size).rev().collect();

        group.bench_with_input(
            BenchmarkId::new("std_sort", size),
            &data,
            |b, data| {
                b.iter(|| {
                    let mut d = data.clone();
                    d.sort();
                    d
                })
            },
        );

        group.bench_with_input(
            BenchmarkId::new("std_sort_unstable", size),
            &data,
            |b, data| {
                b.iter(|| {
                    let mut d = data.clone();
                    d.sort_unstable();
                    d
                })
            },
        );
    }

    group.finish();
}

criterion_group!(benches, bench_sorting);
criterion_main!(benches);

Profiling Tools

flowchart LR
    subgraph "Profiling Tools"
        P[perf]
        F[flamegraph]
        V[valgrind]
        H[heaptrack]
        I[Instruments]
    end

    P --> P1["CPU sampling<br/>Linux"]
    F --> F1["Visual call stacks<br/>Cross-platform"]
    V --> V1["Memory analysis<br/>Linux/macOS"]
    H --> H1["Heap profiling<br/>Linux"]
    I --> I1["Full profiling<br/>macOS"]

    style F fill:#c8e6c9

Flamegraph

# Install
cargo install flamegraph

# Profile (needs debug info in release)
# Add to Cargo.toml:
# [profile.release]
# debug = true

cargo flamegraph --bin my_app

# Opens interactive SVG

perf (Linux)

# Record
perf record -g --call-graph=dwarf cargo run --release

# Report
perf report

# Or use hotspot for GUI
hotspot perf.data

Using DHAT for Heap Profiling

// Enable DHAT allocator
#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    #[cfg(feature = "dhat-heap")]
    let _profiler = dhat::Profiler::new_heap();

    // Your code here
}

Compiler Optimizations

Release Profile Configuration

# Cargo.toml

[profile.release]
opt-level = 3          # Maximum optimization
lto = true             # Link-time optimization
codegen-units = 1      # Better optimization (slower compile)
panic = "abort"        # Smaller binary, no unwinding
strip = true           # Remove symbols

[profile.release-with-debug]
inherits = "release"
debug = true           # Debug info for profiling

[profile.bench]
inherits = "release"
debug = true           # Debug info for benchmarks

Optimization Levels

Level	Flag	Description
0	`-O0`	No optimization (debug)
1	`-O1`	Basic optimization
2	`-O2`	Most optimizations (default release)
3	`-O3`	Aggressive optimization
s	`-Os`	Optimize for size
z	`-Oz`	Optimize for size aggressively

Link-Time Optimization (LTO)

flowchart LR
    subgraph "Without LTO"
        A1[Compile Unit 1] --> O1[Object 1]
        A2[Compile Unit 2] --> O2[Object 2]
        O1 --> L1[Link]
        O2 --> L1
        L1 --> B1[Binary]
    end

    subgraph "With LTO"
        A3[Compile Unit 1] --> O3[Object 1]
        A4[Compile Unit 2] --> O4[Object 2]
        O3 --> L2[Link + Optimize]
        O4 --> L2
        L2 --> B2[Optimized Binary]
    end

    style B2 fill:#c8e6c9

[profile.release]
lto = "fat"      # Full LTO (slowest compile, best optimization)
# lto = "thin"   # Thin LTO (faster compile, good optimization)
# lto = true     # Default (same as "fat")

Memory Optimization

Stack vs Heap

flowchart TB
    subgraph "Stack (Fast)"
        S1["Fixed size types"]
        S2["Function locals"]
        S3["Small arrays"]
    end

    subgraph "Heap (Flexible)"
        H1["Dynamic size (Vec, String)"]
        H2["Large data"]
        H3["Shared ownership (Rc, Arc)"]
    end

    S1 --> F[Fast allocation<br/>Cache friendly]
    H1 --> SL[Slower allocation<br/>More flexible]

    style S1 fill:#c8e6c9
    style S2 fill:#c8e6c9
    style S3 fill:#c8e6c9

Avoiding Allocations

// Bad: Allocates on every call
fn process_bad(items: &[i32]) -> Vec<i32> {
    items.iter().map(|x| x * 2).collect()
}

// Good: Reuse buffer
fn process_good(items: &[i32], output: &mut Vec<i32>) {
    output.clear();
    output.extend(items.iter().map(|x| x * 2));
}

// Good: Pre-allocate
fn process_preallocated(items: &[i32]) -> Vec<i32> {
    let mut result = Vec::with_capacity(items.len());
    result.extend(items.iter().map(|x| x * 2));
    result
}

String Optimization

// Bad: Multiple allocations
fn build_string_bad(parts: &[&str]) -> String {
    let mut result = String::new();
    for part in parts {
        result.push_str(part);
        result.push(' ');
    }
    result
}

// Good: Pre-calculate size
fn build_string_good(parts: &[&str]) -> String {
    let total_len: usize = parts.iter().map(|s| s.len() + 1).sum();
    let mut result = String::with_capacity(total_len);
    for part in parts {
        result.push_str(part);
        result.push(' ');
    }
    result
}

// Best: Use join when applicable
fn build_string_best(parts: &[&str]) -> String {
    parts.join(" ")
}

Small String Optimization with SmartString

use smartstring::alias::String as SmartString;

// SmartString stores small strings inline (no heap allocation)
let small: SmartString = "hello".into();  // Stored inline
let large: SmartString = "this is a much longer string".into();  // Heap allocated

Iterator Optimization

Lazy Evaluation

// Iterators are lazy - no work until consumed
let numbers = vec![1, 2, 3, 4, 5];

// This does nothing yet
let iter = numbers.iter()
    .map(|x| {
        println!("Processing {}", x);
        x * 2
    })
    .filter(|x| x > &4);

// Work happens here
let result: Vec<_> = iter.collect();

Avoiding Intermediate Collections

// Bad: Creates intermediate Vec
fn sum_of_squares_bad(numbers: &[i32]) -> i32 {
    let squared: Vec<i32> = numbers.iter().map(|x| x * x).collect();
    squared.iter().sum()
}

// Good: Chain iterators
fn sum_of_squares_good(numbers: &[i32]) -> i32 {
    numbers.iter().map(|x| x * x).sum()
}

Iterator vs Loop

// Both compile to the same assembly with optimizations!

// Iterator style
fn sum_iter(data: &[i32]) -> i32 {
    data.iter().sum()
}

// Loop style
fn sum_loop(data: &[i32]) -> i32 {
    let mut sum = 0;
    for &x in data {
        sum += x;
    }
    sum
}

SIMD and Parallelism

Auto-vectorization

// The compiler can auto-vectorize this
fn multiply_add(a: &[f32], b: &[f32], c: &mut [f32]) {
    for i in 0..a.len() {
        c[i] = a[i] * b[i] + c[i];
    }
}

// Help the compiler with exact bounds
fn multiply_add_optimized(a: &[f32], b: &[f32], c: &mut [f32]) {
    assert_eq!(a.len(), b.len());
    assert_eq!(b.len(), c.len());

    for i in 0..a.len() {
        c[i] = a[i] * b[i] + c[i];
    }
}

Rayon for Parallelism

use rayon::prelude::*;

fn parallel_sum(data: &[i32]) -> i32 {
    data.par_iter().sum()
}

fn parallel_map(data: &[i32]) -> Vec<i32> {
    data.par_iter().map(|x| x * x).collect()
}

// Parallel sort
fn parallel_sort(data: &mut [i32]) {
    data.par_sort();
}

Cache Optimization

flowchart TB
    subgraph "Memory Hierarchy"
        R[Registers<br/>~1 cycle]
        L1[L1 Cache<br/>~4 cycles]
        L2[L2 Cache<br/>~12 cycles]
        L3[L3 Cache<br/>~40 cycles]
        M[Main Memory<br/>~200 cycles]
    end

    R --> L1 --> L2 --> L3 --> M

    style R fill:#c8e6c9
    style L1 fill:#c8e6c9
    style M fill:#ffcdd2

Data Layout

// Bad: Array of Structs (AoS) - poor cache utilization for partial access
struct ParticleAoS {
    x: f32,
    y: f32,
    z: f32,
    mass: f32,
    velocity_x: f32,
    velocity_y: f32,
    velocity_z: f32,
}

// Good: Struct of Arrays (SoA) - better cache utilization
struct ParticlesSoA {
    x: Vec<f32>,
    y: Vec<f32>,
    z: Vec<f32>,
    mass: Vec<f32>,
    velocity_x: Vec<f32>,
    velocity_y: Vec<f32>,
    velocity_z: Vec<f32>,
}

// When you need all fields together, AoS is better
// When you process fields separately, SoA is better

Sequential Access

// Bad: Random access pattern
fn sum_random(data: &[i32], indices: &[usize]) -> i32 {
    indices.iter().map(|&i| data[i]).sum()
}

// Good: Sequential access pattern
fn sum_sequential(data: &[i32]) -> i32 {
    data.iter().sum()
}

Common Optimizations

Avoiding Bounds Checks

// With bounds checks (safe, slightly slower)
fn sum_checked(data: &[i32]) -> i32 {
    let mut sum = 0;
    for i in 0..data.len() {
        sum += data[i];  // Bounds check on each access
    }
    sum
}

// Iterator (no bounds checks, compiler knows bounds)
fn sum_iterator(data: &[i32]) -> i32 {
    data.iter().sum()
}

// Unsafe (no bounds checks, use carefully)
fn sum_unchecked(data: &[i32]) -> i32 {
    let mut sum = 0;
    for i in 0..data.len() {
        // SAFETY: i is always < data.len()
        sum += unsafe { *data.get_unchecked(i) };
    }
    sum
}

Inline Hints

// Suggest inlining
#[inline]
fn small_function(x: i32) -> i32 {
    x + 1
}

// Force inlining
#[inline(always)]
fn critical_function(x: i32) -> i32 {
    x * 2
}

// Prevent inlining
#[inline(never)]
fn large_function(x: i32) -> i32 {
    // ... lots of code
    x
}

Branch Prediction Hints

// Help the compiler with likely branches
#[cold]
fn handle_error() {
    // This is rarely called
}

fn process(value: Option<i32>) -> i32 {
    match value {
        Some(v) => v * 2,  // Common case
        None => {
            handle_error();
            0
        }
    }
}

Performance Checklist

Before Optimizing:

✅ Code is correct and tested

✅ Profiled to identify hot spots

✅ Measured baseline performance

Optimization Strategies:

🔄 Reduce allocations (reuse buffers, pre-allocate)

🔄 Use iterators instead of intermediate collections

🔄 Consider data layout (SoA vs AoS)

🔄 Enable LTO for release builds

🔄 Use sort_unstable when stability not needed

🔄 Consider parallelism with Rayon

After Optimizing:

✅ Verify correctness (run tests)

✅ Measure improvement

✅ Document why optimization was needed

Tools Summary

Tool	Purpose	Platform
Criterion	Benchmarking	Cross-platform
flamegraph	CPU profiling	Cross-platform
perf	System profiling	Linux
Instruments	Full profiling	macOS
heaptrack	Heap analysis	Linux
cargo-bloat	Binary size	Cross-platform

Next Steps

Learn about Memory Layout.