- Optimization
- Posted by Martin Eisenberg on February 9th, 2004
Hi!
I'm developing a code that processes a sampled audio stream in
realtime. I use MSVC 6 and an Athlon 800. It's DSP folk wisdom that
per-sample conditionals are bad for the pipeline, therefor I've
thought about using this function:
float min1(float x, float y)
{ // By Laurent de Soras.
x -= y;
x -= fabs(x);
x *= 0.5;
x += y;
return x;
}
Now it occurred to me that the following is much simpler--
float min2(float x, float y)
{ return x + (y - x) * (y < x); }
--so I made a small test timing these, and for comparison
float min3(float x, float y)
{ return (y < x) ? y : x; }
as well. I found that, while VC actually does put a branch in the
min2 assembly, it's appreciably faster than min1 but min3 in turn
beats that by a significant amount. In debug mode the order is the
same, although the relative gap between min1 and min2 is larger while
min2 and min3 perform quite similary. Maybe things would look
different when using the routines in context rather than isolated
like this? I haven't done anything yet that needed decision functions
in the inner loop, though.
As I'm basically ignorant* of optimization so far I'd appreciate an
explanation of those results. Is the advice to avoid branching bogus
or does another issue come into play here? My timing procedure calls
clock() before and after a loop applying the functions to a long
vector of (pseudo-)random float values. I'm aware that clock() is not
very accurate in VC, but the number of samples is large and the
results don't change when varying it.
* I do understand the First Rule, etc. But speed does count here.
Martin
--
Please help refine my English usage!
-= Send your critique by email. =-
Quidquid latine dictum sit, altum viditur.
- Posted by Papadopoulos Giannis on February 9th, 2004
Martin Eisenberg wrote:
They are right I think.. Conditionals are feard in DSP designs...
I think fabs() is using a conditional....
It is more likely that is implement something like
double fabs(double x) {
return ((x<0)? -x: x);
}
Yes, but there is also a conditional here... And it does some unecessary
things...
But min3 results in the smallest code... That is why it is faster..
min2 performs 2 additions, a conditional jump and a multiplication when
min3 performs only a conditional jump...
First Rule? Teach me plz...
--
#include <stdio.h>
#define p(s) printf(#s" endian")
int main(void){int v=1;*(char*)&v?p(Little)
(Big);return 0;}
Giannis Papadopoulos
http://dop.users.uth.gr/
University of Thessaly
Computer & Communications Engineering dept.
- Posted by Willem on February 9th, 2004
Martin wrote:
) As I'm basically ignorant* of optimization so far I'd appreciate an
) explanation of those results. Is the advice to avoid branching bogus
) or does another issue come into play here? My timing procedure calls
The advice to avoid *unpredictable* branching is valid, but I'm guessing if
the branching paths are very short, like a couple of instructions, then the
impact isn't that large, while the benefit of not having to rely on a
float-operation (I assume fabs() compiles to a float instruction, and not
to a function call (that may well have a conditional branch itself)) could
very well outweigh that.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by xbunny on February 9th, 2004
Martin Eisenberg wrote:
I think it would be pretty easy to go into the finer points of cache
hits, inlining, preexecution, compiler tricks and loads of other stuff
that certainly other people can argue better. But the fact remains the
min3 works best so unless its a bottle neck I would stick with it!
What is the first rule btw? there are so many 'first rules' lol
- Posted by Papadopoulos Giannis on February 9th, 2004
Martin Eisenberg wrote:
If the args were ints it could look something like this :
/**
* If x>y, then y-x<0 and sign is 1, so a[1] is the smallest.
* If x<y, then y-x>0 and sign is 0, so a[0] is the smallest.
*/
int imin(int x, int y) {
const int a[] = {x, y};
return a[ ((y-x)>>(8*sizeof(int)-1))&0x01 ];
}
(I am trying to make it work with doubles, too)
two loads and 1 subtraction, 1 shift and 1 bitwise AND (8*sizeof(int)-1
is resolved during compile time) (yet is bigger than return (y < x) ? y
: x; )
--
#include <stdio.h>
#define p(s) printf(#s" endian")
int main(void){int v=1;*(char*)&v?p(Little)
(Big);return 0;}
Giannis Papadopoulos
http://dop.users.uth.gr/
University of Thessaly
Computer & Communications Engineering dept.
- Posted by Martin Eisenberg on February 9th, 2004
Hi,
thanks for your replies. Allow me to respond collectively.
Papadopoulos Giannis wrote:
That may well be true in Debug mode, I don't have library source.
With optimizations, it compiles to the fabs instruction.
"Don't." All the introductory articles I've perused state that so I
thought it was idiomatic...
xbunny wrote:
Since the data don't fit into the cache and the code is small in each
case, should each of my consecutive loops not meet about the same
conditions?
So that might be what lends significance to the fact that the
ternary-op version is smallest, as Giannis mentioned.
I am not bent on using any one of them -- just wanted to clear up the
surprise.
Willem wrote:
For instance, because the CPU might take both branches if they are
small? So there's a bad "middle ground" below which the conditional
is circumvented, and above which it's amortized. Does that stand to
reason? If so, can the boundaries of this size range be quantified?
Martin
--
Please help refine my English usage!
-= Send your critique by email. =-
Quidquid latine dictum sit, altum viditur.
- Posted by Willem on February 9th, 2004
Martin wrote:
) Willem wrote:
)> The advice to avoid *unpredictable* branching is valid, but I'm
)> guessing if the branching paths are very short, like a couple of
)> instructions, then the impact isn't that large
)
) For instance, because the CPU might take both branches if they are
) small? So there's a bad "middle ground" below which the conditional
) is circumvented, and above which it's amortized. Does that stand to
) reason? If so, can the boundaries of this size range be quantified?
I have no idea. Like I said, it's just a guess.
Another guess is that the PRNG is such that the processor can predict the
branch correctly most of the time, but that's not very likely.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by Kenneth Chiu on February 9th, 2004
In article <c07lan$2ups$1@ulysses.noc.ntua.gr>,
Papadopoulos Giannis <ipapadop@inf.uth.gr> wrote:
....
Is the concern over conditionals, or conditional branches?
- Posted by Willem on February 9th, 2004
Kenneth wrote:
) Is the concern over conditionals, or conditional branches?
A conditional is almost always a conditional branch.
Unless you have a compiler that is so cunning that it uses some weird sign
extension trick to make a non-branching version of (x < y)
Question/puzzle: How would you implement the C expression (x < y) on an
intel pentium (or later, if you want) processor, without branching ?
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by Kenneth Chiu on February 9th, 2004
In article <slrnc2fa9k.2lla.willem@toad.stack.nl>,
Willem <willem@stack.nl> wrote:
gcc 3.2.2 with "gcc -S -O foo.c" on:
int foo(int x, int y) {
return x < y;
}
produces:
.file "foo.c"
.text
.globl foo
.type foo,@function
foo:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %eax
cmpl %eax, 8(%ebp)
setl %al
movzbl %al, %eax
leave
ret
.Lfe1:
.size foo,.Lfe1-foo
.ident "GCC: (GNU) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)"
I'm not fluent in x86 assembler, but some Googling on the
opcodes seems to suggest that there are no branches. The
setl instruction means "set byte on condition", supposedly.
- Posted by Willem on February 9th, 2004
Kenneth wrote:
) produces:
)
) .file "foo.c"
) .text
) .globl foo
) .type foo,@function
) foo:
) pushl %ebp
) movl %esp, %ebp
) movl 12(%ebp), %eax
) cmpl %eax, 8(%ebp)
) setl %al
) movzbl %al, %eax
) leave
) ret
) .Lfe1:
) .size foo,.Lfe1-foo
) .ident "GCC: (GNU) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)"
)
) I'm not fluent in x86 assembler, but some Googling on the
) opcodes seems to suggest that there are no branches. The
) setl instruction means "set byte on condition", supposedly.
Yup, that's 'set on lower-than', which is an instruction that (looks like
it) is tailored specifically for such conditionals.
Note to the OP: Compile all three of your routines to assembly, and check
what kind of code you're getting on each of them. That might prove useful.
I think that 'a < b ? a : b' will not generate a conditional branch, by the
way, it should compile to something involving a conditional opcode.
Maybe you can even hack it with setl:
% eax and ebx contain x and y
movl %ecx, %eax
cmpl %eax, %ebx
setl %al
movzbl %eax
decl %eax
xorl %ebx, %ecx
andl %ebx, %eax
xorl %ecx, %ebx
% ecx contains the maximum of x and y
% (or the minimum, I can never tell, I usually just testrun it..)
But that's very likely not the best way to do it.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by Papadopoulos Giannis on February 9th, 2004
Martin Eisenberg wrote:
you can use some kind of data prefetching is this helps your code.. Is
the code running on a mainstream CPU or on embedded??
Branch prediction could let hell loose... Although modern machines have
high branch prediction hits, I do not know what happens to DSP code..
they put a lot of stress on the pipeline... I suggest you should use
assembly in all critical sections and if you can, make a symbolic loop
unrolling on the code...
--
#include <stdio.h>
#define p(s) printf(#s" endian")
int main(void){int v=1;*(char*)&v?p(Little)
(Big);return 0;}
Giannis Papadopoulos
http://dop.users.uth.gr/
University of Thessaly
Computer & Communications Engineering dept.
- Posted by CBFalconer on February 9th, 2004
Willem wrote:
Probably totally unnecessary in most architectures. Consider:
sbb ax,ax
which results in -1 or 0, per incoming carry.
--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!
- Posted by Niklas Borson on February 9th, 2004
Willem <willem@stack.nl> wrote in message news:<slrnc2fa9k.2lla.willem@toad.stack.nl>...
Which compilers may very well do in certain cases, e.g., when
the result of the comparison is just used as a number.
Here's a simple C++ function:
bool __fastcall test(int x, int y)
{
return x < y;
}
And here's the ASM listing I get when compiling using VC++ 7.1
with /O2:
?test@@YI_NHH@Z PROC NEAR
; _x$ = ecx
; _y$ = edx
; Line 3
xor eax, eax
cmp ecx, edx
setl al
; Line 4
ret 0
?test@@YI_NHH@Z ENDP
Do I win a prize, or did I cheat by using a compiler? :-)
- Posted by Martin Eisenberg on February 9th, 2004
Papadopoulos Giannis wrote:
My target is x86 floating point. In the framework I'm working in (VST
music software plugin architecture) my code is handed data in blocks
that supposedly fit into the cache, and are often already there due
to the preceding processing chain. The comment above concerns my
simple timing program.
I will consider that when it seems necessary, but so far I have
little incentive to give up the better maintainability of my C++
code.
--
Quidquid latine dictum sit, altum viditur.
- Posted by Papadopoulos Giannis on February 9th, 2004
Martin Eisenberg wrote:
So write everything in C++ / C and let the compiler do everything for
you... For x86 architectures
Use return x>y? y: x; and even better make it a DEFINE or an inline (C++)...
--
#include <stdio.h>
#define p(s) printf(#s" endian")
int main(void){int v=1;*(char*)&v?p(Little)
(Big);return 0;}
Giannis Papadopoulos
http://dop.users.uth.gr/
University of Thessaly
Computer & Communications Engineering dept.
- Posted by Willem on February 9th, 2004
)> Question/puzzle: How would you implement the C expression (x < y) on an
)> intel pentium (or later, if you want) processor, without branching ?
Niklas wrote:
) And here's the ASM listing I get when compiling using VC++ 7.1
) with /O2:
)
) ?test@@YI_NHH@Z PROC NEAR
) ; _x$ = ecx
) ; _y$ = edx
) ; Line 3
) xor eax, eax
) cmp ecx, edx
) setl al
) ; Line 4
) ret 0
) ?test@@YI_NHH@Z ENDP
)
) Do I win a prize, or did I cheat by using a compiler? :-)
I was rather hoping someone would come up with a piece of code that did
*not* use a conditional statement..
Something like the subtract-with-carry mentioned crossthread.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by Martin Eisenberg on February 9th, 2004
Willem wrote:
I've done that. Incidentally, remember that I'm using floating point.
Here's what VC6 spat out with inlining off -- the ternary operator
does branch. I observed that it benefits less from inlining than the
other two functions.
float min1(float x, float y)
{ // By Laurent de Soras.
x = y - x;
x += fabs(x);
x *= 0.5f;
x = y - x;
return x;
}
00404810 fld dword ptr [esp+8]
00404814 fsub dword ptr [esp+4]
00404818 fld st(0)
0040481A fabs
0040481C fxch st(1)
0040481E faddp st(1),st
00404820 fmul dword ptr ds:[422168h]
00404826 fsubr dword ptr [esp+8]
0040482A ret
min2 -- return x + (y - x) * (y < x);
00404830 push ecx
00404831 fld dword ptr [esp+0Ch]
00404835 fcomp dword ptr [esp+8]
00404839 mov dword ptr [esp],1
00404841 fnstsw ax
00404843 test ah,1
00404846 jne 00404850
00404848 mov dword ptr [esp],0
00404850 fld dword ptr [esp+0Ch]
00404854 fsub dword ptr [esp+8]
00404858 fimul dword ptr [esp]
0040485C fadd dword ptr [esp+8]
00404860 pop ecx
00404861 ret
min3 -- return (y < x) ? y : x;
004010A0 fld dword ptr [esp+4]
004010A4 fcomp dword ptr [esp+8]
004010A8 fnstsw ax
004010AA test ah,1
004010AD je 004010B4
004010AF fld dword ptr [esp+4]
004010B3 ret
004010B4 fld dword ptr [esp+8]
004010B8 ret
I'm not fluent in x86 asm either, so that doesnt' tell me all that
much about the relation of min1 and min3. But after looking looking
at the Intel manuals for some time it seems that a min2 without
branch doesn't get any shorter than it is now. But this isn't
something I need to get vastly faster right now, anyway. I'll fiddle
more with it for kicks when I'm sharper.
Martin
--
Please help refine my English usage!
-= Send your critique by email. =-
Quidquid latine dictum sit, altum viditur.
- Posted by Willem on February 10th, 2004
Martin wrote:
) Willem wrote:
)
)> Note to the OP: Compile all three of your routines to assembly,
)> and check what kind of code you're getting on each of them.
)> That might prove useful.
)
) I've done that. Incidentally, remember that I'm using floating point.
) Here's what VC6 spat out with inlining off -- the ternary operator
) does branch. I observed that it benefits less from inlining than the
) other two functions.
Did you also turn optimizations off ?
Did you tell the compiler you're compiling for pentium-pro ?
I compiled the same stuff, and I also get a conditional branch in the third
version. I checked, and if I use ints, I still get a branch.
However, when I told the compiler I was compiling for i686, suddenly the
branches disappeared, and were replaced by a conditional move.
But only in min3, and not in min2.
This is what I got with gcc -march=i686 -O3:
# return ((y < x) ? y : x)
flds 8(%ebp)
flds 12(%ebp)
fxch %st(1)
fucomi %st(1), %st
fcmovnbe %st(1), %st
fstp %st(1)
Which looks pretty near optimal to me.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by Martin Eisenberg on February 10th, 2004
Willem wrote:
As stated in my first post, it did not change the speed ordering.
Why?
I've seen the FP conditional moves in the Intel docs but I can't do
that. My targets (first of all, my own box) don't have them.
Martin