Today I'll be explaining about two well used macros in the kernel for better branch prediction via gcc.
but first of all lets starts from the basics and fundamentals, here is a short refresh. As we may know each instruction in c is translated to assembler language, as I have mentioned in the past post Get familiar with gcc compiler, few years ago.
Each c instruction is translated into assembler instructions,
which pushed into the pipeline. I'll elaborate more about the meaning of pipeline:
Pipeline is processor's mechanism for executing the instructions in parallel.
Moreover as more stages there would be (In the picture I drew, n = 4), we would increase the parallelization property.
on each CPU cycle we shift to the right (the next chain) the current instruction.
So via the parallelization we speed up the execution by fetching the next instruction while the other instruction are getting decoded and executed.
if the pipeline is full, on each cycle tick an instruction will get executed.
of course the number of stages depends on the architecture, for example:
On my BeagleBoard xM (ARM Cortex-A8) implements ARM v7 (32-bit) instruction set architecture consist of 5 stages.
So probably you are asking yourself so how come there are no more stages in nowadays cpu's core. Well although the level of parallelism increases there are few drawbacks which I'll be discuss with you now:
The actual time (latency) to execute the instruction would get larger as we add more stages cause in more cycles we would fill the pipeline.
First instruction: INC_REGISTER R1
Second instruction: INC_REGISTER R1
I'll demonstrate here let's say register R1 consists the value 0x4447, after 3 cycles the value would be 0x4448, and then in the next cycle,
the second instruction gets executed and would still hold the value 0x4448 since the initial value was 0x4447 for the second instruction
too. So my conclusion was we/compiler should avoid instructions which have dependencies from the previous cycle.
In case we have a branch and the condition is satisfied the consecutive instructions are already preloaded into the pipeline we will execute those instructions, but eventually after few loop iterations we would fail on the condition branch, so we should get rid of all the instructions which were
loaded into the pipe, and fetch all new sequential instructions which appear along our new flow.
For getting rid of the irrelevant instructions you
the core flushes those instructions. This kind of operation of changing the program counter unpredictably can easily reduce the performance of the processor.
So now after I clarified the third problematic scenario, I'll now talk about
In the kernel there are two well-known macros, which I use quite often:
likely and unlikely, those macros take advantage of the gcc compiler that can optimize the compilation of the code based on that information.
In case you are quite curios you're more then welcome to check the outcome of using those macros, In case you were wondering how the assembly code would be set for getting an optimization for the processor pipeline. write down a code snippet, and afterwards compiled it via gcc with optimization flag on: gcc –O2
For example I wrote down in my vim editor the following short snippet:
Afterwards you can take a look of the disassembled the binary file via:
objdump -S .
modify the code to the likely macro from the unlikely instance.
So here is the neat results which I got, the comparison between the two is presented in the meld window (gnu diff program):
Likely macro Vs Unlikely macro |
We can easily see the compiler have generated the assembly code (x86) with arranging the code according to the likelihood of the branch,
(I have marked the different assembly lines with colourful rectangle)
So here above we got a simple nice demonstration of avoiding the penalty of flushing the processor pipeline.
I hope you enjoyed today's session, next time I'll be more lifting the hood about the kernel stuff! enjoy!
No comments:
Post a Comment