Make it work, Profile, Optimise. Simple rule that is so hard to follow. In my project I need to carry out a lot of floating point arithmetic operations along the critical path, so I decided to optimise my program with SIMD (Single Instruction Multiple Data) extensions. That's when a processor simulates vector processor operations by sticking 4 32-bit values into one 128-bit register and allows you to carry out operations on all 4 values at once.
SSE2 support is not easy to integrate into a program, especially if you never worked with SSE2 instructions. The function names were so confusing, I had to create my own wrapper. SIMD Shuffle instruction deserves a separate mention, it took me atleast an hour to understand how it works. Shuffle operation is a way of rearranging items inside SIMD register.
Shuffle instruction looks something like this:
float a[4] = {1, 2, 3, 4}; float b[4] = {5, 6, 7, 8}; __m128 sa = _mm_loadu_ps( a ); __m128 sb = _mm_loadu_ps( b ); // yields {1, 3, 7, 8} __asm { shufps sa, sb, 0xe4 }
Note that third operand is an immediate. I.e. you can't stick a variable in there. Basically the operation shuffles items in a and b according to 8-bit mask. Lets take the mask in my code as example.
0xe4 corresponds to 1110100 in binary (number 228). It is more convenient to view this string in bit pairs: 11 10 10 00. Starting from least significant bit, this mask is interpreted as follows:
I.e in my example 0xe4 corresponds to:
It took me a while to figure this out. Luckily, you don't need to worry about a mask, it can be generated for you using a macro (VC++):
float a[4] = {1, 2, 3, 4}; float b[4] = {5, 6, 7, 8}; __m128 a = _mm_loadu_ps( a ); // same result _mm_shuffle_ps( a, b, _MM_SHUFFLE( 3, 2, 2, 0 ) );
Anyway, using SIMD optimisation made my algorithm run twice as fast but... Then I found /arch:sse2 switch inside project properties (C/C++ -> Code Generation -> Enable Enhanced Instruction Set). Yes, when I removed all my optimisations and simply compiled with this flag, the performance was identical. Damn me getting carried away!
Therefore, never forget: compilers nowadays are often more intelligent than you are. :) It is extremely likely that over-optimising can actually have a negative effect on your program (it will waste your time too). So what was it again? Make it work, Profile, Optimise.
More about SSE and Inling Assembly: if you really care about this sort of stuff I found this post extremely useful. You might find it useful too. :)