Sunday, March 23, 2014

Auto - Vectorization with little help from GCC!

This tutorial helps the programmers to benefit the progress of the auto-vectorization algorithms that are implemented in modern compilers, such as gcc. Before you start playing with the vectorization of your code i assume that you don't have any bottleneck in you code (like dynamic memory allocation etc) in the critical path. In this tutorial we will use the gcc 4.4.1, but the same steps can be applied to newer or older versions. 

First of all there are two issues with auto vectorization: 1) gcc must know the architecture (eg what SIMD instructions are available) 2) The data structures must by properly aligned in memory

The first step is to find the architecture of your processor and point it to gcc using the flags -mtune=... / -march=... you specify the architecture.  For example, my laptop is core2Duo so i put -march=core2. You can find more more information hereThe next problem we must solve is knowledge of memory alignment.  You can help the compiler by allocating everything with alignment at least 16 bytes For example: 

1) Initialized static arrays : 

static const uint8_t key_perm[0x40] __attribute__(( aligned(16) )) = { 0x12,... }; 

2) Try to use even space allocate in stack to be aliment (warning: there is a bug with some gcc versions and when running the gcc on cygwin for windows).

3) For the dynamic allocation you should use: memalign (or posix_memalign ) instead of malloc.  


For steps 1 and 2 note that recent versions of gcc automatically align the arrays in the memory.
For example: 

$ cat test.c 
#define N 200000
int a[N] __attribute__(( aligned(16) ));
int b[N] __attribute__(( aligned(16) ));
int c[N] __attribute__(( aligned(16) ));

double  f2(){    
        int i;   
        for (i = 0; i < N; i++)       
                a[i] = b[i] + c[i];
        }

$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=1 -c ./test.c 
./test.c:8: note: LOOP VECTORIZED../test.c:6: note: vectorized 1 loops in function.

The verbose flag should give some info of what vectorized and what not. Even if the arrays are declared in stack, gcc is able to vectorize the loop. Checking the size of the object file we have:


$ size test.o    text    data     bss     dec     hex filename     94       0       0      94      5e test.o

94 Bytes of text, not so bad. Now lets make the example more complicated. Note that i use gcc-4.4, the behavior of older versions may be different.


double  f2(int *a, int *b, int *c, int N){    
     int i;   
     for (i = 0; i < N; i++)       
         a[i] = b[i] + c[i];
}



Compiler must solve two issues in the above example: first the number of iterations is unknown and second the pointers a,b, and may be to point on the same memory location. Dependency analysis on pointer is also called alias analysis. Compiling the example we are getting this:


$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=3 -c ./test.c 
./test.c:6: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:6: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:6: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:6: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:6: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
./test.c:6: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
./test.c:6: note: cost model: Adding cost of checks for loop versioning to treat misalignment.
./test.c:6: note: cost model: Adding cost of checks for loop versioning aliasing.
./test.c:6: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .

./test.c:6: note: Cost model analysis:   
Vector inside of loop cost: 6  
Vector outside of loop cost: 20 
Scalar iteration cost: 4  
Scalar outside cost: 1  
prologue iterations: 0  
epilogue iterations: 2  
Calculated minimum iters for profitability: 7
./test.c:6: note:   Profitability threshold = 6
./test.c:6: note: Vectorization may not be profitable.
./test.c:6: note: created 2 versioning for alias checks.
./test.c:6: note: LOOP VECTORIZED.
./test.c:3: note: vectorized 1 loops in function.

$ size test.o    text    data     bss     dec     hex filename    262       0       0     262     106 test.o


As a result, we must have at least 7 iterations and gcc inserts runtime calls. We can help the situation by using the keyword restrict that is part of C99 standard. More information about gcc and restrict keyword you can find here and here.  For example for the same loop you must change to:

double  f2(int   * restrict a, int * restrict b, int *restrict c, int N)


And the compile:
$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=3 -std=c99 -c ./test.c 
./test.c:11: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:11: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:11: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
./test.c:11: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
./test.c:11: note: cost model: prologue peel iters set to vf/2.
./test.c:11: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
./test.c:11: note: Cost model analysis:   
Vector inside of loop cost: 6  
Vector outside of loop cost: 24  
Scalar iteration cost: 4  
Scalar outside cost: 7 
prologue iterations: 2  
epilogue iterations: 2  
Calculated minimum iters for profitability: 5

./test.c:11: note:   Profitability threshold = 4
./test.c:11: note: Vectorization may not be profitable.
./test.c:11: note: LOOP VECTORIZED.
./test.c:6: note: vectorized 1 loops in function.


We are able to decrease the minimum profitability to 5 iterations by disabling the aliasing analysis.


Hard stuff (next level):


1) Eliminate branches in your code: good for vectorization but the impact in x86 is low - branch predictors are working good. You can use build in unlikely to set hints about branches. For manual branch elimination you can try:http://cellperformance.beyond3d.com/articles/2006/04/benefits-to-branch-elimination.html

2) Write some SIMD assembly directly  ( look here and here <http://gcc.gnu.org/onlinedocs/gcc/X86-Built_002din-Functions.html>). Of course the manuals of intel is also a good start (many pages... but there are good :)).