- Reordering of functions
- Posted by Tim Frink on February 19th, 2008
Hi,
I've a question about the influence of compiler
optimizations that reorder functions on the
system performance.
Assume a modern DSP with all state-of-the art
features like prefetching, branch prediction and
a superscalar pipeline. Further assume that all caches
are disabled. Will the program runtime change when
just the order of functions is changed (without any
other code transformation)?
I'm of the opinion that a reordering of function should
have little influence on the program execution, maybe
due to some prefetch effects but thes should be marginal.
Of course, with caches this situation would look different.
How do you see that? Do you see any other side-effects that
might influence the program?
Regards,
Tim
- Posted by David Brown on February 19th, 2008
Tim Frink wrote:
There can be many small caches or buffers between the program memory and
the cpu core - it all depends on the type of memory and the type of cpu
you are using. For example, if you have 32-bit wide memory and 16-bit
wide instructions, the alignment of code to 32-bit boundaries may have
an effect. If you are using dynamic memory, you'll have paging effects
which will slow down access if you the memory controller has to switch
pages. The memory controller might also fetch 16 byte lines (for
example) even if you are not caching the results.
You will also find that some cpus have smaller instructions for local
jumps compared to far jumps - if the compiler/linker is smart enough,
the jumps may be faster for local code compared to far code.
But for most cpus, if you have disabled caches and are using internal
memory (or at least, memory with consistent access times), there should
be no difference in speed when you reorder the functions.
- Posted by Wilco Dijkstra on February 19th, 2008
"Tim Frink" <plfriko@yahoo.de> wrote in message news
an.2008.02.19.09.09.24.815351@yahoo.de...
Of course! All of the micro architectural features you mention critically
rely on code layout. Instruction fetch usually reads more than one instruction,
so branch target alignment matters. A branch predictor works like a cache,
so has the usual conflicts due to branch alignment and aliasing. Even with
caches disbled instruction and data fetches are done using wide bursts.
And unless you disable the TLB as well, you get different TLB misses.
I've seen a factor of 2 difference in execution time of small loops on a
superscalar CPU while running from tightly coupled SRAM (not cached,
nor translated). Using a higher associativity (cache and branch predictor),
wider fetch and larger instruction buffers (between fetch and decode)
tends to reduce these effects significantly.
The really interesting question is whether it is feasible to optimise code
alignment in a compiler. The problem is you often don't know the final
alignment of the code. But even if you do, you can't easily deduce which
branches/cachelines will actually conflict at runtime. Even profile feedback
does not help here...
Wilco
- Posted by badal_akr on February 19th, 2008
reorder the functions. So you never know what is going to happen.
- Posted by Ali on February 19th, 2008
On Feb 19, 5:09 pm, Tim Frink <plfr...@yahoo.de> wrote:
Lets assume that all functions are doing the same stuff with same
style, now there should be no difference in reordering. but, putting
functions with different scopes SHOULD effect the app throughput.
For example TLB miss, far jumps and fetching the large array are
noticeable examples.
ali
- Posted by Tim Frink on February 19th, 2008
I agree, that was what I meant with the little impact of prefetching, i.e.
it might happen that the number of fetched instructions at a branch
target differs.
Ok, that's true. I assume that you mean branch target buffer which stores
computed branch targets to avoid redundant recomputations.
What do you mean with wide bursts?
But how do compilers handle that? There are alignment optimizations.
Do they statically align all frequently accessed branch targets to an
appropriate address?
- Posted by Didi on February 19th, 2008
This can be answered by the programmer - in this case,
the author of the compiler and/or the library functions you
will have linked after writing your lines of HLL code. Or you
can do some reverse engineering on the compiler
you use and get the answer yourself.
I suspect you will get as many different answers as there are
programmers who are out there offering their compiler code.
A much easier way out of this is to program the section
of interest yourself, of course.
Dimiter
------------------------------------------------------
Dimiter Popoff Transgalactic Instruments
http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tg...7600228621276/
Tim Frink wrote:
- Posted by Wilco Dijkstra on February 19th, 2008
"Tim Frink" <plfriko@yahoo.de> wrote in message news
an.2008.02.19.16.42.21.975121@yahoo.de...
I wouldn't call it a small impact - as badal pointed out, it also affects
instruction pairing. This can have a major effect if the CPU pairs
instructions based on their position in a fetch slot (a bad idea but
it is common in older in-order cores).
Correct. A BTB is typically indexed by the fetch address, so the number of
branches and their position in the fetch block matter. BTB's usually have low
associativity and are relatively small, so the aliasing effects can be large
(I've seen 5-10% performance difference on Dhrystone).
Modern DDR DRAMs are always accessed in burst mode, so even with
the cache disabled, a memory access will read 32 or 64 bytes. Usually
the fill buffers act as a mini cache, so further accesses are fast. Executing
a loop inside a single fillbuffer rather than spread over 2 bursts gives a
huge speedup.
Compilers can align code to fetch boundaries and do instruction
scheduling/pairing based on that. Some compilers align function entry
and critical loops to a cache line. However overaligning code slows things
down due to being larger and having to execute large numbers of NOPs.
So my advice to hardware designers has always been to design micro
architectures that minimise the effects of fetch and cache alignment.
Wilco
- Posted by Vladimir Vassilevsky on February 20th, 2008
Tim Frink wrote:
....and the errata of 100 pages, with the half of bugs due to the
incorrect operation of the pipeline in the certain cases.
Not much, unless the functions are very short so their length is
comparable with the depth of the pipeline, and the time of execution is
comparable with T_RAS.
It can significantly affect the interrupt latency though. Interrupt
handlers should be aligned to the prefetch size.
Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com