Rust: Iterators are not faster; Anyway, you should use them

June 15, 2024 · 6 min read

Senior Forensic Analyst

Since some time, I'm wondering whether using iterators in Rust has some benefits over traditional for loops.

The idea

Consider the following loop:

let mut sum = 0;
for i in 1..=100 {
    sum += i;
}

which expands during execution to

let mut sum = 0;
sum += 1;
sum += 2;
sum += 3;
// ...

The way we expressed the sum also defines in which order the summation has to be done. But, to be honest, we are not interested in the concrete order. In fact, the single summations could be done on multiple cores simultaneously.

An alternative way of implementing the above algorithm using iterators could be

let sum: u32 = (1..=100).sum();

Can this be faster? I don't know. So, let's use a more complicated example and check it out

The test case

To test which approach is faster, we need to prevent the compiler to optimize too simple things. For example, the above code using the for loop compiles to

    xor     eax, eax
    mov     ecx, 100
    xor     edx, edx
.LBB45_1:
    mov     esi, edx
    lea     edx, [rsi, +, 1]
    cmp     esi, 100
    cmovae  edx, ecx
    add     eax, esi
    cmp     esi, 99
    ja      .LBB45_3
    cmp     edx, 101
    jb      .LBB45_1
.LBB45_3:
    ret

while the iterator based approach compiles to

    mov     eax, 5050
    ret

Of course, this results from the fact that the compiler knows a lot about the code to be optimized. So, again, we need a more complicated code.

We'll do the following:

iterate through the first 1_000_000 integers, stored as String
convert the String into u64
filter out numbers which are greater or equal to 100_000
filter out numbers which are no prime numbers
calculate the sum of those numbers

We use the following implementations:

Loop approach

#[inline(never)]
fn sum1<'s>(data: impl Iterator<Item = &'s String>) -> u64 {
    let mut s = 0;
    for i in data {
        if let Ok(i) = i.parse::<u64>() {
            if i < 100_000 {
                if i.is_prime() {
                    s += i;
                }
            }
        }
    }
    s
}

Iterator approach

#[inline(never)]
fn sum2<'s>(data: impl Iterator<Item = &'s String>) -> u64 {
    data.filter_map(|s| s.parse::<u64>().ok())
        .filter(|i| i < &100_000)
        .filter(|i| i.is_prime())
        .sum()
}

Common code

To implement the is_prime method, I use the following simple code, which is using the primes crate:

trait IsPrime {
    fn is_prime(&self) -> bool;
}

impl IsPrime for u64 {
    fn is_prime(&self) -> bool {
        primes::is_prime(*self)
    }
}

In addition, I generate a static data array NUMBERS:

lazy_static! {
    static ref NUMBERS: Vec<String> = (1..1_000_000).map(|i| format!("{i}")).collect();
}

Benchmarking code

To test the speed of the above code, I'm using cargo bench with the following test code

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn test_sum1(b: &mut Bencher) {
        b.iter(|| assert_eq!(sum1(NUMBERS.iter()), 454396537))
    }

    #[bench]
    fn test_sum2(b: &mut Bencher) {
        b.iter(|| assert_eq!(sum2(NUMBERS.iter()), 454396537))
    }
}

Benchmarking results

Running the benchmark yields the following result:

test tests::test_sum1 ... bench:  25,637,873.10 ns/iter (+/- 1,921,695.73)
test tests::test_sum2 ... bench:  25,639,836.30 ns/iter (+/- 1,405,975.61)

OMG, what has happened???!!? Both approaches are equally performant. There seems to be no important difference between using a loop or iterator.

That's our first result: It doesn't matter whether you use traditional loops or iterators. You can expect both to be similar fast.

Iterators on steroids

Let's change out iterator based sum function a little bit: replacing Iterator by ParallelIterator, which is a trait from the rayon crate. rayon allows you to do anything in parallel what you can do with normal iterators. So, if you have some code which is based on using iterators, you can simply switch to parallel iterators.

We add an additional sum function:

#[inline(never)]
fn sum3<'s>(data: impl ParallelIterator<Item = &'s String>) -> u64 {
    data.filter_map(|s| s.parse::<u64>().ok())
        .filter(|i| i < &100_000)
        .filter(|i| i.is_prime())
        .sum()
}

Be aware that the only difference to sum2 is the trait ParallelIterator. No other code change is necessary.

Also, we need a benchmarking function:

#[bench]
fn test_sum3(b: &mut Bencher) {
    b.iter(|| assert_eq!(sum3(NUMBERS.par_iter()), 454396537))
}

As you can see, there is another difference: to create a ParallelIterator, we use the function par_iter() instead of iter(). So, you have the full control whether you want to use parallel iterators or not.

Benchmarking parallel iterators

The benchmarking result (on a system with 4 virtual CPUs) is very informative:

test tests::test_sum1 ... bench:  25,621,303.60 ns/iter (+/- 1,528,567.84)
test tests::test_sum2 ... bench:  25,752,965.60 ns/iter (+/- 2,927,367.52)
test tests::test_sum3 ... bench:  11,761,348.20 ns/iter (+/- 2,472,399.91)

We can conclude:

using the parallel iterator in our test case is significantly faster
migrating the code from traditional iterators to parallel iterators is simple

Iterators as design pattern

Using iterators is an implementation of the Pipes-and-Filters design pattern:

This approach has a lot of practical benefits:

it is easy to
- insert a processor into the pipeline
- remove one processor from the pipeline
- change the order of processors in the pipeline
every single processor is independant of the other, which allows to distribute them over multiple cores or even machines
it can be easier to read (if you're familiar with the concept)

Conclusion

It doesn't matter if you use traditional loops or iterators. There will be no significant performance impact. But, if you want to be able to do significant changes, such as distributing loop iterations over multiple CPUs, without breaking your code, use iterators.

One more thing

It is very important to implement you algorithm in the right way. But it is more important to understand the algorithm in detail, and to keep your implementation as neat to that as possible.

For example, let's change sum1 by inserting a break:

#[inline(never)]
fn sum1<'s>(data: impl Iterator<Item = &'s String>) -> u64 {
    let mut s = 0;
    for i in data {
        if let Ok(i) = i.parse::<u64>() {
            if i >= 100_000 {
                break;
            }
            if i.is_prime() {
                s += i;
            }
        }
    }
    s
}

and again benchmarking:

test tests::test_sum1 ... bench:  11,603,891.70 ns/iter (+/- 742,577.75)
test tests::test_sum2 ... bench:  25,899,039.20 ns/iter (+/- 1,916,178.50)
test tests::test_sum3 ... bench:  11,899,570.60 ns/iter (+/- 5,477,771.27)

What has happened? Clearly, it doesn't make sense to iterate through 1_000_000 numbers, while we're only interested in the first 100_000. Because when using a loop we know the order of the execution, we can abort as soon as we crossed the border to uninteresting numbers (those above 100_000). In our test case, this has a similar effect to the performance such as using parallel iterators.

So, again: First you need to completely understand your problem, choose the best matching algorithm, and implement it correctly. Then, you can optimize.

The idea​

The test case​

Loop approach​

Iterator approach​

Common code​

Benchmarking code​

Benchmarking results​

Iterators on steroids​

Benchmarking parallel iterators​

Iterators as design pattern​

Conclusion​

One more thing​