The previous benchmark function had a few flaws. First of all, it wasn't
idiomatic Rust, because we used a loop construct that you would expect in C.
Revamped that by using an iterator. Also, the previous benchmark got heavily
optimized by the compiler, which unrolled the inner loop it into a huge sequence
of consecutive loads and stores, resulting in lots of instructions that needed
to be fetched from DRAM. Additionally, instruction caching was not turned on.
The new code compiles into two tight loops, fully leveraging the power of the I
and D caches, and providing an great showcase.