Sky Watch

Impact of cache prefetching

I am not always interested in the low-level nonsense. But when I am, I write a blog post about it.

A while ago a colleague of mine had a discussion with me about our data structure. I was considering using lists instead of arrays for a matrix-like object, and he was objective about it. Eventually he convinced me by reasoning that cache prefetching is advantageous which is very hard to achieve with a linked list. I believed him and put the idea aside. But I knew I would not be really convinced without any experimental data.

So I wrote this quick program (part of it):

int main()
    const unsigned long NElements = 100000000;
    const unsigned int NRuns = 100;

    const unsigned int SizeInMB = NElements * sizeof(Type) / 1024 / 1024;
    std::cout << "Size of structure: " << SizeInMB << "MB" << std::endl;

    unsigned long* Indices = new unsigned long[NElements];

    // Reference in order
    for(unsigned long i = 0; i < NElements; i++)
        Indices[i] = i;
    Measurement Result = test(NElements, NRuns, Indices);
    std::cout << "Time of reference in order: " << Result.Value
              << " ± " << Result.Error << "ms ("
              << NRuns << " samples)" << std::endl;

    // Reference at random
    for(unsigned long i = 0; i < NElements; i++)
        Indices[i] = randInt(0, NElements - 1);
    Result = test(NElements, NRuns, Indices);
    std::cout << "Time of reference at random: " << Result.Value
              << " ± " << Result.Error << "ms ("
              << NRuns << " samples)" << std::endl;

    delete[] Indices;
    return 0;

I wanted to see how a loop over array elements can benefit from cache prefetching. First the program loops over an array in an ordered fashion, and then refers to elements in another array randomly. Since random number generation may be slow, I generate the random sequence of array indices before the test. The test function is just the loop that should be measured,

Measurement test(const unsigned long n_data, const unsigned long n_run,
                 const unsigned long* indices)
    Type* Data = new Type[n_data];
    unsigned long Times[n_run];
    timeval Begin, End;
    for(unsigned int Run = 0; Run < n_run; Run++)
        Type a;
        gettimeofday(&Begin, 0);
        for(unsigned long i = 0; i < n_data; i++)
            a += Data[indices[i]];
        gettimeofday(&End, 0);
        Times[Run] = (End.tv_sec - Begin.tv_sec) * 1000000 +
            (End.tv_usec - Begin.tv_usec);
        // Explicit side effect so that the loop will not be optimized
        // out by the compiler.
        std::cerr << a;
    delete[] Data;

    double Avg = average(Times, n_run);
    double Err = stdDev(Times, n_run, Avg);

    Measurement Result;
    Result.Value = Avg / 1000.0;
    Result.Error = Err / 1000.0;
    return Result;

On my 2011 Macbook Pro, the test results are

Size of structure: 381MB
Time of reference in order: 107.156 ± 18.2239ms (100 samples)
Time of reference at random: 2204.8 ± 124.422ms (100 samples)

These numbers, my friends, is a textbook example of what are “significantly different”. And also kids, a measurement without error is not meaningful in anyway; remember that. I compiled the program with clang++ -O2. I got similar results with -O1 and no optimization, as well as with GCC on Linux.

UPDATE: I made a statistical mistake in my program. I should have used standard error instead of standard deviation as the error estimator of the average run time. As a result the error on those numbers should be smaller by a factor of 10.