Does the order in which you run two 'for' loops matter?

  • Thread starter Thread starter Eclair_de_XII
  • Start date Start date
  • Tags Tags
    Loops Matter
Click For Summary

Discussion Overview

The discussion revolves around whether the order of nested 'for' loops affects the performance of a script, particularly in relation to array traversal in programming. It explores concepts of memory storage, optimization techniques, and the impact of cache behavior on execution time.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Exploratory

Main Points Raised

  • Some participants propose that the order of loops does not matter in terms of the total number of cells processed, as both approaches involve iterating through the same number of elements.
  • Others argue that memory storage order (row major vs. column major) can influence performance, suggesting that accessing neighboring elements in memory may yield a CPU performance boost.
  • A participant mentions that using a one-dimensional array with a specific access method can provide speed advantages over nested loops, although this may depend on the programming language and specific requirements for accessing row and column data.
  • One post highlights that optimization techniques can lead to unexpected performance results, recommending timing tests for different approaches if speed is a priority.
  • There is a discussion about the role of the instruction cache and data cache in performance, noting that accessing consecutive memory locations can enhance efficiency due to cache line behavior.
  • Another participant points out that compiler optimizations may also play a role, particularly when processing elements in memory order, potentially reducing indexing overhead.
  • It is mentioned that the ability of the compiler to utilize vector instructions may also be affected by the order of loop execution.

Areas of Agreement / Disagreement

Participants express a range of views on the impact of loop order on performance, with no clear consensus reached. Some agree on the importance of memory access patterns, while others emphasize the theoretical equivalence of loop execution order.

Contextual Notes

Limitations include varying definitions of performance based on specific programming languages, the potential influence of compiler optimizations, and the specific characteristics of the hardware being discussed.

Eclair_de_XII
Messages
1,082
Reaction score
91
TL;DR
For example, if I wanted to run through a table with m rows and n columns. Is it faster if I check the values of the table row-wise or column-wise?
It's a simple question, and I feel like the order in which you run two loops would not matter in regards to how long it would take to run the script. You'd still have to run through ##m \cdot n## cells no matter the order. If you check column-wise, you will have to check all m entries in each of the n columns; similarly, if you check row-wise, you check all n entries in each of the m columns. I feel like it would take the same amount of time to run some code regardless of the order you choose to run a 'for' loop within a 'for' loop, for the sole reason that multiplication is a commutative operation, in other words.
 
Technology news on Phys.org
your array is stored in memory following either row major or column major order depending on the programming language. Consequently, you might get a CPU boost if you process each successive element of the array if they are neighbors in memory.
 
  • Like
  • Informative
Likes   Reactions: hmmm27, jim mcnamara, hutchphd and 2 others
As jedishrfu says, maybe. One way of making an M×N array is to define a one dimensional array with MN elements and provide some method of accessing the first N elements as (1,m) and the next N elements as (2,m) and so on. You can do this explicitly in languages like C and C++. In that case you get a slight speed boost by having one loop that runs MN times instead of nested loops, because you don't have to do the inner loop initialisation and don't have to translate (n,m) into n×M+m.

But sometimes you need to know the row and column location, or maybe the language doesn't let you play games with array structures and pointers. In that case you have to use two loops and the order of the loops probably doesn't matter.
 
  • Like
Likes   Reactions: Eclair_de_XII and sysprog
The wiki article covers this and notes which languages are row major or column major:

https://en.wikipedia.org/wiki/Row-_and_column-major_order

In the case of hardware, when a fixed sized matrix, such as a n row by 512 byte column matrix, needs to be efficient for both row and column ordering the, memory addressing is setup so that both row and column major order end up being column access order for the ram.
 
Optimization techniques can cause unexpected timing results. I once thought that I was being smart by traversing an array of data first one way, then backward, and alternating after that. My idea was that the most recent data used would still be in memory when the algorithm turned around. It ran slower. Optimizing techniques include "look-ahead" retrieval of data. I recommend that you conduct some timing tests of alternative approaches if speed is an important issue. And if speed is important, select the highest optimization level that you can.
 
the idea of neighboring values speeding your computation is due to the instruction cache used by some processors anticipating the locations you will next access.
 
jedishrfu said:
the idea of neighboring values speeding your computation is due to the instruction cache used by some processors anticipating the locations you will next access.
It's really the data cache. The main effect is that data is moved from main memory to the caches in cache lines.
A common size of a cache line on 64 bit processors is 64 bytes.
If you do not read from consecutive locations, you might be using only 1 double precision number (8 bytes) per cache line that you read (64 bytes). Accessing large blocks of memory could be up to 8 times slower.

If you have only a small array that fits in the smallest data cache, it might not matter however.
A 32k level one cache can contain a 64x64 array of double precision values.

If you do more calculations with each array value, the extra delay might also disappear. There is a data prefetcher that can recognize regular access patters of data, and load your block into the cache before you need it.
 
  • Like
Likes   Reactions: jedishrfu and FactChecker
Another potential factor is optimization performed by the compiler.

If you process the elements in the order they occur in memory, the compiler is more likely to recognize that you are treating the array as one-dimensional. The result could be the saving of some of the array indexing arithmetic and the availability of one more register.

But this would be a very minor.
 
In also can affect wether the complier can use vector instructions (e.g. SSE).
 
Last edited:

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
Replies
4
Views
4K
  • · Replies 17 ·
Replies
17
Views
3K
  • · Replies 18 ·
Replies
18
Views
6K
  • · Replies 14 ·
Replies
14
Views
5K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 3 ·
Replies
3
Views
4K
  • · Replies 30 ·
2
Replies
30
Views
7K