- How could I program this algorithm?
- Posted by mike3 on June 26th, 2008
Hi.
I was writing a program that uses the algorithm described here:
http://www.csc.lsu.edu/~gb/TCE/Publi...ransposeTR.pdf
to transpose a matrix stored on disk quickly. However, what's the
fastest way to do those bit permutations in there, which seem tricky
to build?
- Posted by Jens Thoms Toerring on June 26th, 2008
mike3 <mike4ty4@yahoo.com> wrote:
That might be an interesting paper if I had the time to read
it carefully. I guess you will increase your chances of recei-
ving an useful answer a lot if you would explain which "bit
permutations" you are talking about instead of expecting
everyone to read through 29 pages (which don't even contain
the phrase "bit permutation" and the word "bits" just twice)
and then to figure out what you mean with that phrase...
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
- Posted by mike3 on June 26th, 2008
On Jun 26, 3:23 am, j...@toerring.de (Jens Thoms Toerring) wrote:
By that I mean the stuff in there where it describes transformations
of the bits of the linear address vector (row/column addressing turned
into a linear data address) with matrices. Namely, there are three
transformation matrices used: R, W, and P, which transform the
linear addresses (cycling through the matrix row-by-row) into
locations to read and write data, as well as how to permute the data
in memory. But a direct, naive implementation where one uses these
matrices explicitly to permute the _bits_ would take many CPU
operations, and so may not be the most efficient method. For example
they describe the "block" transposition method in this framework, yet
I know it can be implemented without resorting to manipulating
the address vector with matrix multiplication. So I'm trying to figure
out the best way to implement the algorithm they are presenting in the
paper, which seems to be very efficient.
- Posted by Gene on June 26th, 2008
On Jun 26, 3:31*pm, mike3 <mike4...@yahoo.com> wrote:
It almost certainly doesn't matter. A single disk head seek takes as
much time as tens of millions of instructions. I'd pick a method that
you're most likely to get correct.
- Posted by mike3 on June 27th, 2008
On Jun 26, 5:11*pm, Gene <gene.ress...@gmail.com> wrote:
The rub, though, is that once the data is read in from disk one is
going
to have to permute some in memory, and memory is much faster than
disk, so... But I guess only a measurement can/would tell. Is there
any
source code that implements this algorithm that I could see, perhaps,
to give me some hints though?
- Posted by Jens Thoms Toerring on June 27th, 2008
mike3 <mike4ty4@yahoo.com> wrote:
I guess you're best chance is asking the original authors if they
are willing to share their implementation. It's such a rather spe-
cial problem that the likelyhood of finding someone here who also
implemented the algorithm might be rather small...
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
- Posted by mike3 on June 27th, 2008
On Jun 27, 1:57 am, j...@toerring.de (Jens Thoms Toerring) wrote:
Where could I contact the authors, anyway?
- Posted by Jens Thoms Toerring on June 27th, 2008
mike3 <mike4ty4@yahoo.com> wrote:
The paper you cited lists their email addresses on the first page,
directly below the title. I don't want to copy them here to avoid
directing more SPAM to them. The user names are enclosed in curly
braces and the domain is 'cis.ohio-state.edu. Of course, some of
the addresses may not exist anymore but a few are probably still
used.
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
- Posted by Moi on June 28th, 2008
On Thu, 26 Jun 2008 23:51:06 -0700, mike3 wrote:
I did a quick read through the paper, and it seems an attempt to construct
a framework for unifying most of the existing methods for matrix-transpose.
The R- P- and W permutation matrices are just their way to specify the
various methods. Most (all?) of the methods are of the "recursively
permute and merge" kind, similar to your earlier attempts.
There is another factor, that is not addressed in the pape (if I read it
correctly ...) : double buffering. The OS keeps a perfect LRU chain of
disk buffers for you, and duplicating these into your userspace memory is
basically a waste of memory. The ultimate form of avoiding double
buffering is of course mmap()ing the file, if mmap is available.
BTW, I still think that Morton-addressing is the way to go, certainly if
your matrix is square.
There are two factors that make disk-I/O slow: latency and bandwidth. You
can compensate for latency by making your algorithm parallel, but once the
bandwidth is saturated, your cpu still has cycles to waste. Or do useful
work.
HTH,
AvK
- Posted by mike3 on June 28th, 2008
On Jun 27, 3:57 pm, j...@toerring.de (Jens Thoms Toerring) wrote:
Hmm. I might give this a try, then.
- Posted by mike3 on June 28th, 2008
On Jun 28, 5:35 am, Moi <r...@invalid.address.org> wrote:
However, they also describe it this way in another paper that
looks devoted specifically to this algorithm:
http://csc.lsu.edu/~gb/TCE//Publicat...Transpose2.pdf
So then is there a better way of implementing it then having bit
permutations?
I'm developing the program under a 64-bit Sun Solaris 10 UNIX
environment
on a 64-bit x86 quad-core parallel processor, so this functionality is
available.
How does that work, anyway? From a previous post of yours I'm guessing
one would store the matrix in square blocks, the blocks stored one
after
the other. So if the biggest thing that fits in RAM is size n/2, and
our matrix
is size n (this is row/column size, the full sizes are n^2/4 and n^2,
respectively), then store blocks ABCD in that order on disk, which
represent
the matrix
AB
CD
?
The thing is though once it is saturated and the disk is thrashing,
the OS
itself slows down waiting for the read-in. And how does parallelizing
the
algorithm help when hard drives are not really parallel devices to
begin
with? For example they don't have several heads that can be used to
read/write different spots on the disk simultaneously. I suppose it
would
work if one had two or more disks (ideally 4 -- one for each core of
the
CPU), and the paper I linked to in this post mentions the usage of
multiple disks, one for each processor unit. In the paper they then
say
the address would be composed not just of row/column indexes, but
also the processor/disk index -- suggesting a "blocked scheme" similar
to that "Morton addressing" you mentioned if I understand the latter
correctly. Unfortunately though I do not have the luxury of multiple
hard
disks in the computer I'm using and am not going to blow cash on
drives just for fluff computations like this. Quickpi and Pifast do
big runs
very efficiently and so far the author(s) have not really divulged
their
secrets. I am only left with half-informed guesses as to how they
work. I
was told about the use of memory-mapping/virtual-memory, for example,
as well as the usage of an n-step Bailey method with transpose --
hence
the reason for asking about transpose here -- but not any of the
really
neat details like what transposition algorithm was employed.
- Posted by Moi on June 29th, 2008
On Sat, 28 Jun 2008 12:24:38 -0700, mike3 wrote:
No. It is just their way of modelling things.
It is not about "bit permutations" It is about calculating the lineair
address, given the row/column index:
{row,column} -> lineair_address --> {pagenumber, offset}
And, there are two such mappings, one for row-major order, and one for
column-major. The transformation between these two mappings is modelled
(by them)as W*P*R, but it could just as easy have been done by only one
matrix. (The transformation that this matrix performs is: map a coordinate
to the coordinate of this element if you go from row-order to column order)
NOTE: You still have to make sure you have some locality of reference.
A naive
for (irow =0; irow < nrow; irow++) {
for (icol=irow+1; icol < NCOL; icol++) {
SWAP ( address(i,j) , address(j,i) );
}}
will effectively fetch more pages than needed (because you access the
pages "vertical" strip sparsely and multiple times.)
Yes. But you can apply it recursively:
AB EF
CD GH
IJ MN
KL OP
You can think of this as mixing the bits off the x- and y coordinates,
for instance cell=G has {x=2,Y=1}
x=2 --> binary 10 stretched = 1000 shifted = 0100
y=1 --> binary 01 stretched = 0010 -->> = 0010
combined (ORed) = 0110
Now the fun thing is that exchanging x and y is just exchanging the odd
bits in the address with the even bits:
new_address =
((old_address & 0xAAAAAA)>>1) | ((old_address & 0x55555555)<<1);
thus 0110 <--> 1001
Even more: if you consider ABCD to fit on one page, (and EFGH on the next)
you can use a bitmask (or division) to find the pagenumber and offset:
offset = (address & 3);
page = address >>2;
For the G-cell (0110) above: page=0x1, offset=2;
For it's mirror-image J (1001) : page-0x2, offset=1
Please define "thrashing".
Also note that a transpose of a 1GB matrix needs *at least* 1GB to be read
+ 1 GB to be written to disk. So there will be some grinding of teeth...
No. the OS does not slow cause of disk-I/O. Your read() system call blocks
your process. But the OS may put other processes to work.
You don't need CPU's. For a typical I/O-bound process like this, even a
single CPU spins idle, because all tasks / processes are waiting for I/O
to become ready. A single CPU can easily keep ten or twenty disks busy.
Multiple hard disks on the same bus still share I/O bandwidth.
But they *can* do multiple seeks concurrently.
It just depends whether your I/O is latency-bound or bandwidth-bound.
(Modern disks can do "concurrent seeks" (don't know how it is
called ...) : you ask for a few blocks, and whenever a block is present/
fetched, it is passed back to the OS) The OS _can_ wakeup a process or
thread. Using more processes or threads will allow some (all -Ncpu) of
them to be blocked on a read, while others proceed.
I don't believe in N-step. It assumes that "sequential reads" are faster
than random reads. This may be true for a raw disk. For a standard
filesystem, I consider all reads as random. (=seek+read)
I'll post the demonstration code for a "tiled" transpose tomorrow.
It typically needs 2 * (buffersize/elementsize) buffers fo the two tiles
to be swapped. If you have more memory, it can be parallelized.
HTH,
AvK
- Posted by mike3 on June 29th, 2008
On Jun 28, 6:06*pm, Moi <r...@invalid.address.org> wrote:
So is there any other way to _implement_ that address transformation
_without_ permuting the bits? Then again since this is an I/O-bound
problem, not a compute-bound one, it should not matter, no?
Yeah, that's right, it faults again and again and so the disk is
thrashed like
a stuck pig.
Hmm.
Well, yeah, it's going to be working but where it's taking _waaaaaaay_
longer
than it takes to duplicate the 1GB of data and the system is getting
unresponsive.
I can still get some responsiveness even when copying a 1GB file in
the
background or unzipping a huge archive file.
I'm just commenting on the behavior I've observed -- that the system
slows
and loses responsiveness (sometimes a LOT) when the disks are being
hit
hard by a crappy transposition method. It certainly hits the disk
harder than
copying files as when a file copy is going I can still get some
responsiveness
out of the system.
But more disks, though. Regardless of disks/CPU, just more disks.
I've got 4 separate ports *on the board*, isn't that 4 buses?
So what is the best way then to do an FFT on the disk? Since that is
ultimately the problem I am concerned with here: how to do a high-
efficiency FFT/NTT on disk. The author of Quickpi obviously figured
this
out, but apparently it must be tough...
- Posted by Moi on June 29th, 2008
On Sat, 28 Jun 2008 18:27:59 -0700, mike3 wrote:
If you don't want to "permute bits", don't use a computer.
Or buy some ROM ...
Please be more specific. What does your program do. Again: what does
"thrashing" mean ?
*What* is taking way longer. Be more specific.
As for cp: look up what cp does. Use the source, Luke.
My guess is that you are "poisoning" the LRU cache. Every disk page that
you access is placed on the front of the LRU. Others are pushed to the
back.
Find a way to inform the OS that -once written- you are no longer
interested in the block. (maybe madvise(), maybe msync(), maybe even
synchronous writes) Look up what cp does. Or: make your own cp (one buffer
at a time, sequentially) and see what it does to the system.
That is probably more bandwidth than one port. They probably still share a
DMA channel, which has it's limits, too.
I don't do FFT.
I do know, that if you access data on a disk, you'd better organize things
in such a way that you minimize the number of required disk-I/O operations.
The author of Quickpi probably found a way to reshuffle things nicely.
HTH,
AvK
- Posted by mike3 on June 29th, 2008
On Jun 29, 4:15*am, Moi <r...@invalid.address.org> wrote:
I guess so, what I mean is making routines that actually
swap around the bits in an address using quite a few
CPU operations.
Right now I'm trying to put together routines to do an FFT (Fast
Fourier
Transform)/NTT (Number Theoretic Transform) on disk. Their use is to
make a program that computes a lot of digits of pi and other numbers
in non-decimal bases.
The overall disk transform routine.
It makes a copy, no? How many times longer than a copy
can one expect this to be? 20 or more? I was only expecting a few
(3 or 4, maybe). Maybe that was too optimistic, but then again
I'm puzzled over the Quickpi thing.
Hmm. I might give this a try.
Hmm. However this is probably not an option since I'd rather not blow
money
on buying extra hard disks just for "fluff" programs like this.
Although if I get
additional drives for other purposes I suppose I'd put them to use.
You don't do FFT? Then why advise against the usage of Bailey's, etc.
2/4/5/6-step
method ("I don't believe in N-step")? I thought you must have done at
least something
with it... I guess I was wrong... 
- Posted by Moi on June 29th, 2008
On Sun, 29 Jun 2008 12:33:52 -0700, mike3 wrote:
Well, a naive cp has a memory footprint of 2 * size. (it pulls a buffer
into the LRU, and pushes the coied buffer into LRU as well)
Find out what cp does better.
This thread is not about FFT. It is about reading / permuting / writing
data from / to disk. The transpose job is just O(N), FFT is O(N log(N)),
which only means it is even more I/O bound than transpose.
That's all I need to now at this stage. FFT *does* need "bit
permutations", but that is not relevant here.
HTH,
AvK