Sunday, March 23, 2014

Auto - Vectorization with little help from GCC!

This tutorial helps the programmers to benefit the progress of the auto-vectorization algorithms that are implemented in modern compilers, such as gcc. Before you start playing with the vectorization of your code i assume that you don't have any bottleneck in you code (like dynamic memory allocation etc) in the critical path. In this tutorial we will use the gcc 4.4.1, but the same steps can be applied to newer or older versions. 

First of all there are two issues with auto vectorization: 1) gcc must know the architecture (eg what SIMD instructions are available) 2) The data structures must by properly aligned in memory

The first step is to find the architecture of your processor and point it to gcc using the flags -mtune=... / -march=... you specify the architecture.  For example, my laptop is core2Duo so i put -march=core2. You can find more more information hereThe next problem we must solve is knowledge of memory alignment.  You can help the compiler by allocating everything with alignment at least 16 bytes For example: 

1) Initialized static arrays : 

static const uint8_t key_perm[0x40] __attribute__(( aligned(16) )) = { 0x12,... }; 

2) Try to use even space allocate in stack to be aliment (warning: there is a bug with some gcc versions and when running the gcc on cygwin for windows).

3) For the dynamic allocation you should use: memalign (or posix_memalign ) instead of malloc.  

For steps 1 and 2 note that recent versions of gcc automatically align the arrays in the memory.
For example: 

$ cat test.c 
#define N 200000
int a[N] __attribute__(( aligned(16) ));
int b[N] __attribute__(( aligned(16) ));
int c[N] __attribute__(( aligned(16) ));

double  f2(){    
        int i;   
        for (i = 0; i < N; i++)       
                a[i] = b[i] + c[i];

$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=1 -c ./test.c 
./test.c:8: note: LOOP VECTORIZED../test.c:6: note: vectorized 1 loops in function.

The verbose flag should give some info of what vectorized and what not. Even if the arrays are declared in stack, gcc is able to vectorize the loop. Checking the size of the object file we have:

$ size test.o    text    data     bss     dec     hex filename     94       0       0      94      5e test.o

94 Bytes of text, not so bad. Now lets make the example more complicated. Note that i use gcc-4.4, the behavior of older versions may be different.

double  f2(int *a, int *b, int *c, int N){    
     int i;   
     for (i = 0; i < N; i++)       
         a[i] = b[i] + c[i];

Compiler must solve two issues in the above example: first the number of iterations is unknown and second the pointers a,b, and may be to point on the same memory location. Dependency analysis on pointer is also called alias analysis. Compiling the example we are getting this:

$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=3 -c ./test.c 
./test.c:6: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:6: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:6: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:6: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:6: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
./test.c:6: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
./test.c:6: note: cost model: Adding cost of checks for loop versioning to treat misalignment.
./test.c:6: note: cost model: Adding cost of checks for loop versioning aliasing.
./test.c:6: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .

./test.c:6: note: Cost model analysis:   
Vector inside of loop cost: 6  
Vector outside of loop cost: 20 
Scalar iteration cost: 4  
Scalar outside cost: 1  
prologue iterations: 0  
epilogue iterations: 2  
Calculated minimum iters for profitability: 7
./test.c:6: note:   Profitability threshold = 6
./test.c:6: note: Vectorization may not be profitable.
./test.c:6: note: created 2 versioning for alias checks.
./test.c:6: note: LOOP VECTORIZED.
./test.c:3: note: vectorized 1 loops in function.

$ size test.o    text    data     bss     dec     hex filename    262       0       0     262     106 test.o

As a result, we must have at least 7 iterations and gcc inserts runtime calls. We can help the situation by using the keyword restrict that is part of C99 standard. More information about gcc and restrict keyword you can find here and here.  For example for the same loop you must change to:

double  f2(int   * restrict a, int * restrict b, int *restrict c, int N)

And the compile:
$ gcc -O3 -ftree-vectorize -march=core2 -ftree-vectorizer-verbose=3 -std=c99 -c ./test.c 
./test.c:11: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:11: note: vect_model_load_cost: unaligned supported by hardware.
./test.c:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
./test.c:11: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
./test.c:11: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
./test.c:11: note: cost model: prologue peel iters set to vf/2.
./test.c:11: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
./test.c:11: note: Cost model analysis:   
Vector inside of loop cost: 6  
Vector outside of loop cost: 24  
Scalar iteration cost: 4  
Scalar outside cost: 7 
prologue iterations: 2  
epilogue iterations: 2  
Calculated minimum iters for profitability: 5

./test.c:11: note:   Profitability threshold = 4
./test.c:11: note: Vectorization may not be profitable.
./test.c:11: note: LOOP VECTORIZED.
./test.c:6: note: vectorized 1 loops in function.

We are able to decrease the minimum profitability to 5 iterations by disabling the aliasing analysis.

Hard stuff (next level):

1) Eliminate branches in your code: good for vectorization but the impact in x86 is low - branch predictors are working good. You can use build in unlikely to set hints about branches. For manual branch elimination you can try:

2) Write some SIMD assembly directly  ( look here and here <>). Of course the manuals of intel is also a good start (many pages... but there are good :)). 

Saturday, March 15, 2014

Turning on the LED at ZedBoard

The ZedBoard has many available leds, but the one of them can be accessed direct when you are in bare metal without any address translation. This tool can be very helpful specially if you debug the first steps of a new operating system and you don't have console output. The led belongs to the General Purpose I/O (GPIO) mechanism.

Here is an example of the code:

    asm volatile (
   "ldr     r7, =0xE000A204\n"
   "ldr     r8, =0x00000080\n"
   "str     r8,[r7],#4\n"
   "str     r8,[r7]\n"

   "ldr     r7, =0xE000A040\n"
   "ldr     r8, [r7]\n"
   "orr     r8, r8, #0x00000080\n"
   "str     r8,[r7]\n"

Of course someone can convert the code to the pure assembly, but again we have to use these addresses. The first part of the code set the register controls whether the IO pin is acting as an input or an output. Since the input logic is always enabled, this effectively enables/disables the output driver. Next we store again the same value 4 bytes later (note the #4). The second store is done in 0xE000A208 and sets output is enabled. The second store on 0xE000A040 controls the value being output when the GPIO signal is configured as an output. In this case i think a simple store with the value 0xFFFFFFFF can also work.

You can find more information about the control registers here.

Porting XEN (3.X) on zedboard

Recently i started working in a new project: porting of an old version of XEN (3.x) with a 2.8.29 linux kernel on the ZedBoard. There are two problems: first only new kernels are able to run in the board (3.4 or newer) and they are provided from Xilinx with modifications. The second problem is that the newest kernels have support for newer XEN version, that requires ARM virtualization. No matter we choose we have to port XEN. This means we have to rewrite some basic support for the architecture. The TEGRA port can be really helpful in this case.

Someone can port the XEN on zedboard with two ways:

1. Using a Samsung PV port for the Tegra board: this requires the porting of the PV kernel in terms of network/terminal drivers. We need to backport the latest Xilinx drivers from 3.8 kernel to 2.8.29. On the good side, we don't need to xenize the linux kernel and it saves a lot of time in terms of debugging XEN.

2. The second choice was to use the latest Xilinx kernel with the old XEN. This approach one problem: the kernel supports XEN but only the latest version. Thus, we have to backport the XEN support for the XEN 3.x version. On the good side, we don't need to spend time for backporting the drivers.

Anyway, I choosed the first option and things seems to go well, but slow. I 'm able to use the terminal, but having issues with the network driver. I also having some issues with the second core, but at least it is a small step to the goal of running. When the project finishes we are going to release the code.

Saturday, March 8, 2014

Static linking with gcc and g++

In this tutorial, we will explain what the static linking is, how this affect the size of final binary, and why statically linking with g++ sometimes is pain. By definition, a statically compiled binary is a group of programmer ‘s routines, external functions, and variables which are packed into the final binary executable. The compiler or the linker produces the final object and embeds all the functions and variables and the linking phase. 

There are two reasons of using dynamic linking and shared libraries: 1) Avoid creating a huge binary, if all the programs use a standard set of libraries why not having the operating system providing to them 2) Compatibility on operating system and machine dependant characteristics: sometimes the libraries must be implemented based on the architecture or the operating system and using dynamic linking is an easy way to avoid this catch. On the other hand, static linking is the ideal way of distributing one software product, paying of course the cost of larger file size compared with the dynamic linking. A programmer can create a binary that has no dependencies in different library editions. New linker options for g++ 4.5 make this information applicable also in g++.

We will use three other different tools except gcc/g++: the nm that list symbols from object files, the ldd that print shared library dependencies and the size that list section sizes and total size.

For this tutorial we will start by using a simple c test file:

:~/toys/c-tests$ cat test3.c 
#include <stdio.h>
#include <stdlib.h>

#define MAX  43
long long unsigned int array[MAX];

int main(int argc, char **argv){
   int i;
   for (i=0;i<MAX;i++)

   for (i=0;i<MAX;i++)
     printf(" %4llu",array[i]);

   return 0;

The file includes one call to rand and one call to printf function. Let's compile the file without any flag:

:~/toys/c-tests$ gcc test3.c 

Now using the nm we can find the undefined functions.

:~/toys/c-tests$ nm -g a.out 
00000000004006e8 R _IO_stdin_used
                 w _Jv_RegisterClasses
0000000000600e30 D __DTOR_END__
0000000000601028 A __bss_start
0000000000601018 D __data_start
0000000000601020 D __dso_handle
                 w __gmon_start__
0000000000400600 T __libc_csu_fini
0000000000400610 T __libc_csu_init
                 U [email protected]@GLIBC_2.2.5
0000000000601028 A _edata
00000000006011b8 A _end
00000000004006d8 T _fini
0000000000400428 T _init
0000000000400480 T _start
0000000000601060 B array
0000000000601018 W data_start
0000000000400564 T main
                 U [email protected]@GLIBC_2.2.5
                 U [email protected]@GLIBC_2.2.5

We can see that the functions __libc_start_main (entry point of the program), printf, and rand are undefined. Next we use the ldd to see the shared files dependencies:

:~/toys/c-tests$ ldd a.out =>  (0x00007fffec5ff000) => /lib/ (0x00007f528d2ec000)
/lib64/ (0x00007f528d65c000)

Finally using the size tool we see that the size of the executable is actually something more that 2KBytes.

:~/toys/c-tests$ size a.out 
   text    data     bss     dec     hex filename
   1363     528     376    2267     8db a.out

Next step is to create a static linking binary. We use information from the GCC linker manual.

:~/toys/c-tests$ gcc -static test3.c 
:~/toys/c-tests$ ldd a.out 
not a dynamic executable

The printout of the nm is much bigger because of libc liking. Unfortunately no the binary is much bigger around 624KBytes:

:~/toys/c-tests$ size a.out 
   text    data     bss     dec     hex filename
 608112    3648   12824  624584   987c8 a.out

To decrease the file size of final binary you can use some linking flags such as -Wl,-dead_strip,-gc-sections  after using -ffunction-sections -fdata-sections compiler flags. However, gc-sections is gcc and architecture depentant so use it carefully. Sometimes it  does not even run! For example:

:~/toys/c-tests$ gcc -static  -ffunction-sections -fdata-sections -Wl,-dead_strip,-gc-sections   test3.c 
/usr/bin/ld: warning: cannot find entry symbol ad_strip; defaulting to 00000000004001d0

A warning... who cares?

:~/toys/c-tests$ size a.out 
   text    data     bss     dec     hex filename
 577481    3392   11016  591889   90811 a.out

Good! We decrease the file size to 577 KBytes!

:~/toys/c-tests$ ./a.out 
Segmentation fault

Oups, it seems that we must be more careful with the warnings! Tip: remove the '-dead_strip' flag!

The story of g++ is a bit different. g++ uses also the libstdc++ except of libc, so a simple -static-libgcc  or -static is not enough. Fortunately a new option introduced in gcc-4.5, the -static-libstdc++ that solves this problem. You can find more information about this issue in this blog, dated from 2005. Let's make the same experiment with the g++ 4.6.
:~/toys/c-tests$ cat test4.cpp 
#include <iostream>
#include <vector>
#include <cstdio>

using namespace std;

int main(){
  vector<int> coll;    // vector container for int

  // append elements 

  for (int i=1; i<=6; ++i) {

  // print all elements 

  for (int i=0; i<coll.size(); ++i) {
    cout << coll[i] << ' ';
  cout << endl;  

Let's  try again to see if we are missing any symbol:

:~/toys/c-tests$ g++-4.6 test4.cpp 
:~/toys/c-tests$ nm a.out | grep " U "
                 U _Unwind_Resume
                 U _ZNSolsEPFRSoS_E
                 U _ZNSolsEi
                 U _ZNSt8ios_base4InitC1Ev
                 U _ZNSt8ios_base4InitD1Ev
                 U _ZSt17__throw_bad_allocv
                 U _ZSt20__throw_length_errorPKc
                 U _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_
                 U _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_c
                 U _ZdlPv
                 U _Znwm
                 U __cxa_atexit
                 U __cxa_begin_catch
                 U __cxa_end_catch
                 U __cxa_rethrow
                 U __gxx_personality_v0
                 U __libc_start_main
                 U memmove
                 U puts

Many, let's size the size:

:~/toys/c-tests$ size a.out 
   text   data    bss    dec    hex filename
   8711    712    304   9727   25ff a.out

8,7 KB is a nice number, but what happens if we statically link the binary? In latest versions of g++ usually only "--static" flag is necessary:

:~/toys/c-tests$ g++-4.6 test4.cpp --static
:~/toys/c-tests$ size   a.out 
   text   data    bss    dec    hex filename
1313651  10672  94441 1418764 15a60c a.out

Huge! Lets see if we decrease a bit the size. We are going to use the '-gc-sections' of the linker. 

:~/toys/c-tests$ g++-4.6 test4.cpp --static  -Wl,-gc-sections,--print-gc-sections    -ffunction-sections -fdata-sections 2>&1 | grep removing | wc -l


We removed 2585 symbols not bad. Lets calculate the size of the binary:

:~/toys/c-tests$ size a.out 
   text   data    bss    dec    hex filename

1094767  10392  85673 1190832 122bb0 a.out

We saved 219 KB. It is not munch, but it is the best that we can do.

A new blog... a new start.

I 'm working as storage systems engineer at OnApp. I'm conducting research (eg. playing) in ARM micro-server architectures in the topic of the virtualization (XEN). I 'm also working on different small projects when i have time, mostly systems engineering. This blog contains short stories of systems development as some experiences from kernel hacking and research community. Basically i post whatever i'm working... even useless :)