- Hi, how can I optimize the following code?
- Posted by kenneth on December 11th, 2006
I'm trying to use multiple thread to optimize my following code, but
can not find the way, anyone can help me?
double k[3][500] are initialized.
int computePot() {
int i, j;
for( i=0; i<500; i++ ) {
for( j=0; j<i-1; j++ ) {
distx = pow( (k[0][j] - k[0][i]), 2 );
disty = pow( (k[1][j] - k[1][i]), 2 );
distz = pow( (k[2][j] - k[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
}
I doubt that If this code really can be optimized?
- Posted by Pascal Bourguignon on December 11th, 2006
"kenneth" <james.hunter.gm@gmail.com> writes:
And if you write it as:
for( i=0; i<500; i++ ) {
pots[i]=0;
for( j=0; j<i-1; j++ ) {
distx = pow( (k[0][j] - k[0][i]), 2 );
disty = pow( (k[1][j] - k[1][i]), 2 );
distz = pow( (k[2][j] - k[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pots[i] += 1.0 / dist;
}
pot+=pots[i];
}
?
You can fork 500 threads, each doing it's own j loop, and when all the
threads are done you can sum the results of the threads.
(Or, more complically, you could sum as soon as each thread is
finished, since they're of different length, it'd be faster).
If you've got a fixed finite number of processors, you could schedule
the j loops to them. A first step could be to notice that if you do
in a thread both the j and (i-j) loops, then all threads will do the
same amount of work, so you can more easily schedule these n/2 threads
than the original n (=500).
So, with p processors, you could schedule:
/* pseudo-code */
const int n=500;
const int p=4;
double pots[p];
int complete;
double compute_single_pot(i){
double pot=0;
for( j=0; j<i-1; j++ ) {
distx = pow( (k[0][j] - k[0][i]), 2 );
disty = pow( (k[1][j] - k[1][i]), 2 );
distz = pow( (k[2][j] - k[2][i]), 2 );
dist = sqrt( distx + disty + distz );
pot += 1.0 / dist;
}
return(pot);
}
void compute_quarter_pot(int quarter){
for(i=quater;i<n/2;i+=p){
pots[quater]+=compute_single_pot(i)+compute_single_pot(n-1-i);
}
with_mutex(complete++);
}
double compute_pot(void){
complete=0;
for(k=0;k<p;k++){
fork_thread(compute_quarter_pot(k));
}
wait_for(complete==p);
double pot=0;
for(k=0;k<p;k++){
pot+=pots[k];
}
return(pot);
}
You need to watch again sw4.
--
__Pascal Bourguignon__ http://www.informatimago.com/
Un chat errant
se soulage
dans le jardin d'hiver
Shiki
- Posted by pablo reda on December 11th, 2006
You can optimize the pow function , the sqrt function and de division.
converto to fixed point number sure more speed, or integer for best.
I'm think...thread is not a good idea
realtime 3d graphics has good web pages !!
sure you usefull.
Godd Luck, Pablo.
- Posted by Arthur J. O'Dwyer on December 11th, 2006
On Mon, 11 Dec 2006, kenneth wrote:
Pascal's given you a very detailed answer (although I haven't tried
to understand the second half of it, so I won't vouch for its total
accuracy). However, if you really just care about efficiency, the first
things to look at are:
* You use 'pow' instead of simply multiplying two numbers. That is
slow and stupid.
* Your inner loop seems to have an off-by-one bug. Keep in mind the
Zeroth Rule of Optimization: Do it correctly, /then/ do it fast.
(Or, as I've seen it stated somewhere before but can't recall where,
"I'd rather be right than dead.")
* You use 'sqrt' plus a division rather than just using 'pow', which
might actually be a good thing if you have a funny math library, but
I'd expect to see an explanatory comment if that were the case. More
likely, you were so burned out after pounding that nail in with a
wrench, you didn't notice you were trying to drive a screw with a
hammer.
-Arthur
- Posted by dcorbit@connx.com on December 12th, 2006
Maybe something like this:
#include <math.h>
double computePot(const double k[3][500])
{
int i,
j;
double distx,
disty,
distz,
dist,
pot = 0;
for (i = 0; i < 500; i++) {
for (j = 0; j < i - 1; j++) {
double diff;
diff = (k[0][j] - k[0][i]);
distx = diff * diff;
diff = (k[1][j] - k[1][i]);
disty = diff * diff;
diff = (k[2][j] - k[2][i]);
distz = diff * diff;
dist = sqrt(distx + disty + distz);
pot += 1.0 / dist;
}
}
return pot;
}
- Posted by tugboat90 on December 12th, 2006
First you can simplify your array access. This by itself should
drastically improve your code. Especially if you are trying to access
different parts of the array, the processor will have to keep moving
different parts of a very large array in and out of memory. (Less calls
to random parts of large arrays = faster program)
for (i = 0; i<500; i++) {
int k0 = k[0][i];
int k1 = k[1][i];
int k2 = k[2][i];
for (j = 0; j < i-1; j++).... // What happens when i == 0? You can't
access k[1][-1]...
Also, get rid of the 'pow ()' function. Since you are just raising
squaring the value, use simple multiplication to help you. Function
calls are horribly expensive and if you don't need them, don't use
them.
distx = k[0][j] - k0; // using my code from above
distx *= distx;
You can get rid of your 'sqrt ()' call and following division and
replace it with "pow ( (x+y+z), -0.5)" which is the mathmatical
equilivant. (Assuming 'pow ()' will let you take -0.5 as a parameter,
if it doesn't then stick with what you have)
For a simple program like this, I'm not sure that you even need to have
multi-threading. If you did, your program might even run slower.
You're still going to have to do the same amount of work, plus on top
of that you'll have all of the overhead of locking critical sections
(i.e. the incrementation of 'pot') and then you'll have to wait until
all of the threads are done to combine answers. If the different
threads would work on different processors, it would speed up, but for
multi-threading on the same processor, I don't believe it is worth your
time.
Hope that helps!
- Brian
kenneth wrote:
- Posted by dcorbit@connx.com on December 12th, 2006
tugboat90 wrote:
[snip]
This is almost certainly bad advice. Generally speaking, a sqrt()
operation can be performed in hardware for slightly more cost than a
single division. The pow() function will be considerably more costly
{though I will admit that a clever enough compiler could certainly
substitute sqrt(f) for pow(f,0.5)}.
- Posted by Willem on December 12th, 2006
dcorbit@connx.com wrote:
) tugboat90 wrote:
) [snip]
)> You can get rid of your 'sqrt ()' call and following division and
)> replace it with "pow ( (x+y+z), -0.5)" which is the mathmatical
)> equilivant. (Assuming 'pow ()' will let you take -0.5 as a parameter,
)> if it doesn't then stick with what you have)
)
) This is almost certainly bad advice. Generally speaking, a sqrt()
) operation can be performed in hardware for slightly more cost than a
) single division. The pow() function will be considerably more costly
) {though I will admit that a clever enough compiler could certainly
) substitute sqrt(f) for pow(f,0.5)}.
I think the idea of using the pow() function is to incorporate the sqrt and
the 1/x into one function call. pow(f, -0.5) == 1.0 / sqrt(f)
It would depend on the processor (and its FPU) which is faster, I guess.
Benchmarking would be in order, I guess.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by tugboat90 on December 12th, 2006
Yep - that was what I was thinking.
Benchmarking would be the only way to see if it would actually work.
I'm not quite sure how fast a call to 'sqrt ()' is compared to 'pow ()'
or compared to the original code was. Does anybody happen to know by
chance or have tried it before?
- Posted by Alan Morgan on December 12th, 2006
In article <1165854253.478885.142550@f1g2000cwa.googlegroups. com>,
kenneth <james.hunter.gm@gmail.com> wrote:
I don't really see that multiple threads would help that much.
Change pow(x,2) to x*x and see if that helps. The famous (?) fast
inverse sqrt trick might come in handy here, although it was designed
for floats, not doubles, and isn't quite as accurate. It may be good
enough. Check to see if the compiler is generating SSE (or whatever
your cpu's equivalent is) instructions. You might be able to convince
it to do so with some manual loop unrolling.
Failing that, crank the optimization level of the compiler as high as
it will go and declare victory.
Alan
--
Defendit numerus
- Posted by rossum on December 12th, 2006
On 11 Dec 2006 08:24:13 -0800, "kenneth" <james.hunter.gm@gmail.com>
wrote:
structure. A one dimensional array reduces the calculation needed to
access each array element and the three coordinates are at fixed
offsets within the structure. This removes an integer multiplication
from each array access.
You are using three distance variables where only one is needed. One
may well be faster as the compiler can keep the single variable in a
register without having to store it out to memory as it would with
three separate variables:
sum_dist_sq = pow(...);
sum_dist_sq += pow(...);
sum_dist_sq += pow(...);
dist = sqrt(sum_dist_sq);
Time it and see. This assumes that you do not need the values of
distx etc. elsewhere.
Arranging the array to maximise cache hits might also help if that is
possible.
Using floats in place of doubles can speed things up if the reduced
precision of the result is acceptable.
rossum
- Posted by Willem on December 13th, 2006
rossum wrote:
) I would be inclined to have a one dimensional array of a 3D-Coordinate
) structure. A one dimensional array reduces the calculation needed to
) access each array element and the three coordinates are at fixed
) offsets within the structure. This removes an integer multiplication
) from each array access.
Many optimizing compilers will do this for you.
) You are using three distance variables where only one is needed. One
) may well be faster as the compiler can keep the single variable in a
) register without having to store it out to memory as it would with
) three separate variables:
)
) sum_dist_sq = pow(...);
) sum_dist_sq += pow(...);
) sum_dist_sq += pow(...);
) dist = sqrt(sum_dist_sq);
Many optimizing compilers will do this for you.
But in any case, you could just as well write the whole thing as a single
expression, and have the compiler figure out the best way to implement it.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by kenneth on December 14th, 2006
Thanks for your guys help. I try to use multiply thread just because if
the platform is 2 core or more core CPU, it will be more efficiency.
The following are my last version:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <windows.h>
#include <time.h>
#include <process.h>
#define _MT
#define NPARTS 1000
#define NITER 201
#define DIMS 3
#define POW2(r) (r)*(r)
#define RAND 0.5 + ( (double) rand() / (double) RAND_MAX )
#define p 4
int rand( void );
int computePot(void);
void initPositions(void);
void updatePositions(void);
double compute_single_pot(int i);
DWORD WINAPI compute_quarter_pot(void* quarter);
double r[DIMS][NPARTS];
static double pot;
static double pots[p];
int main() {
int i;
clock_t start, stop;
initPositions();
updatePositions();
start=clock();
for( i=0; i<NITER; i++ ) {
srand(i);
computePot();
if (i%10 == 0) printf("%5d: Potential: %10.3f\n", i, pot);
pot = 0;
updatePositions();
}
stop=clock();
printf ("Seconds = %10.9f\n",(double)(stop-start)/ CLOCKS_PER_SEC);
}
void initPositions() {
int i, j;
for( i=0; i<DIMS; i++ )
for( j=0; j<NPARTS; j++ )
r[i][j] = RAND;
}
void updatePositions() {
int i, j;
for( i=0; i<DIMS; i++ )
for( j=0; j<NPARTS; j++ )
r[i][j] -= RAND;
}
int computePot()
{
int k;
HANDLE hThread[p];
DWORD dwThreadId[p];
int nParam[p];
for(k=0;k<p;k++)
{
nParam[k] = k;
hThread[k] = CreateThread(
NULL,
0,
compute_quarter_pot,
(void*)nParam[k],
0,
&dwThreadId[k]
);
}
::WaitForMultipleObjects(p, hThread, TRUE, INFINITE);
for(k=0;k<p;k++){
pot+=pots[k];
pots[k] = 0;
}
return 0;
}
DWORD WINAPI compute_quarter_pot(void* quarter)
{
int m = (int)quarter;
for(int i = m;i<NPARTS/2;i+=p)
{
pots[m]+=compute_single_pot(i)+compute_single_pot(NPARTS-1-i);
}
return 0;
}
inline double compute_single_pot(int i){
double pot=0,t;
int j;
for( j=0; j<i-1; j++ ) {
t = POW2(r[0][j] - r[0][i])+POW2(r[1][j] - r[1][i])+POW2(r[2][j] -
r[2][i]);
pot += 1.0 / sqrt(t);
}
return(pot);
}
---------------------------
do you think there have more space to speed up?
- Posted by Walter Bright on December 14th, 2006
tugboat90 wrote:
You won't gain anything by doing this, and might even make things worse.
Addressing modes are 'free' on modern CPUs. Check the assembler output
of your compiler before attempting such micro-optimizations.
And as always, use a profiler to make sure your optimizations really do
make things better.
See http://www.digitalmars.com/techtips/timing_code.html
and http://www.digitalmars.com/ctg/trace.html