- Synchronization routine
- Posted by Joris Dobbelsteen on October 30th, 2005
I'm looking into a writitng a efficient routine and am looking for different
views of the problem:
Basically I have two lists, one on my firewall and one in a database. The
list on the firewall must become the same as the list in the database. The
current sizes are over 1 million entries and I expect to see 2 million rows
in the future.
Unfortuanlly the firewall vendor (Microsoft) made loading that list very
very slow when it gets large enough. Loading this list will take several
days on a very decent machine. Unfortuanlly I have only a slow system for
todays standards (would be top notch 6-7 years ago) but for the purpose.
So I have:
from the firewall:
Set A, which I can randomly access (though slowly)
from the database:
Set B, which is ordered, but can only be traversed sequentially once(!).
I expect the sets to scale up to 2 million elements.
The results should be:
C = A / B = all elements from A that are not in B
D = B / A = all elements from B that are not in A
I do not update set A until after retreiving the results. I have absolutely
no clue how they remove or add an enty to the list. So I don't want to jump
to make any assumptions in this regard.
Of course this must be done with only very limited computing power and I
mind the memory use.
My current solutions to the issue were:
Under the assumption that set A was ordered I used a 'merge join' which was
very fast. Unfortunally the assumption was flawed.
Using a sort routine has not yet worked out. I used quicksort, but I don't
believe its the correct solution. The entries in A are largely ordered so
quicksort wasn't too efficient and finally caused a stack overflow.
Next is using a 'hash join'. In my previous (probably inefficient)
implementation I saw memory usage reach over 100 MB for only 40000 entries.
Maybe anyone has a good suggestion which direction I should look for?
I thought it is a very nice problem, but hard to get it right.
- Joris
- Posted by Willem on October 30th, 2005
Joris wrote:
) I'm looking into a writitng a efficient routine and am looking for different
) views of the problem:
)
) Basically I have two lists, one on my firewall and one in a database. The
) list on the firewall must become the same as the list in the database. The
) current sizes are over 1 million entries and I expect to see 2 million rows
) in the future.
) Unfortuanlly the firewall vendor (Microsoft) made loading that list very
) very slow when it gets large enough. Loading this list will take several
) days on a very decent machine. Unfortuanlly I have only a slow system for
) todays standards (would be top notch 6-7 years ago) but for the purpose.
)
)
) So I have:
)
) from the firewall:
) Set A, which I can randomly access (though slowly)
How do you randomly access it ? Do you access the Nth entry, or can you
access matching entries or something ?
) from the database:
) Set B, which is ordered, but can only be traversed sequentially once(!).
) I expect the sets to scale up to 2 million elements.
)
) The results should be:
) C = A / B = all elements from A that are not in B
) D = B / A = all elements from B that are not in A
)
) I do not update set A until after retreiving the results. I have absolutely
) no clue how they remove or add an enty to the list. So I don't want to jump
) to make any assumptions in this regard.
)
) Of course this must be done with only very limited computing power and I
) mind the memory use.
)
)
) My current solutions to the issue were:
)
) Under the assumption that set A was ordered I used a 'merge join' which was
) very fast. Unfortunally the assumption was flawed.
)
) Using a sort routine has not yet worked out. I used quicksort, but I don't
) believe its the correct solution. The entries in A are largely ordered so
) quicksort wasn't too efficient and finally caused a stack overflow.
You should tweak your quicksort so that it is efficient on largely ordered
sets(1). If it causes a stack overflow, you implemented it badly enyhow(2).
Or did you use some standard function ?
1) To make quicksort behave on almost-ordered sets, pick several items
from your sublist and take the median one as your pivot. Three should
suffice.
2) If you want to make sure that quicksort doesn't use more than O(log N)
stackspace, you have to implement it as tail recursion, and recurse into
the smaller partition first. Alternatively, you could implement your
own stack, allocating fixed space beforehand in the order log N entries.
) Next is using a 'hash join'. In my previous (probably inefficient)
) implementation I saw memory usage reach over 100 MB for only 40000 entries.
How much space would one entry take normally ? And how similar are your
entries ? If they often share the same prefix, you should look into tries.
) Maybe anyone has a good suggestion which direction I should look for?
) I thought it is a very nice problem, but hard to get it right.
It depends a lot on what your data looks like exactly.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by Ertugrul Soeylemez on October 31st, 2005
"Joris Dobbelsteen" <joris.dobbelsteen@mail.com> (Sun, 30 Oct 2005 21:30:48+0100):
Taking into account that you have an old machine, you might be very low
on memory, so this isn't practical. For about one million entries, the
memory consumtion is the same as the size of the individual entries, in
MB! If every entry is 128 bytes in size, then a million entries will
consume roughly 128 MB. And the data size is well under-estimated. You
should consider this.
Regards.
-----
Public key "Ertugrul Soeylemez <never@drwxr-xr-x.org>" (id: CE402012)
Fingerprint: 0F12 0912 DFC8 2FC5 E2B8 A23E 6BAC 998E CE40 2012
HKP: hkp://subkeys.pgp.net/
LDAP: ldap://keyserver.pgp.com/
HTTP: http://www.keyserver.de/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDZZGJa6yZjs5AIBIRAnSjAJ9l5LOGfkwS9ygBwafnFq MK+nRZ7QCeJQgR
zle64AgayrlAEWqeenrEf+w=
=qs00
-----END PGP SIGNATURE-----
- Posted by Joris Dobbelsteen on October 31st, 2005
For some funny reason I see your text as a attachment in outlook express...
"Ertugrul Soeylemez" <never@drwxr-xr-x.org> wrote in message
news:20051031043743.0328ed08@kill.mine.nu...
I did consider this.
The average entry size is 24 bytes. So this makes 24 MB for 1 M entries.
With pointers and such it will grow probably several MBs too, but not that
far. The data from the database was pulled for a sequential scan only, not
to comsume too much memory.
The previous (probably flawed) implementation used over 128 MB for only 40k
entries. The dataset was very similair, so I would have expected only
several MB, not growing fast too. Probably something here went wrong.
- Joris
- Posted by Joris Dobbelsteen on October 31st, 2005
"Willem" <willem@stack.nl> wrote in message
news:slrndmap85.1eho.willem@toad.stack.nl...
Looks like TU/e 
I took an example from a book, the "C Handbook". They noted its not the
fasted, but not too much information. Wikipedia I probably didn't fully
follow.
Its more that its consist of a lot of ordered sequences. So a large part was
ordered, after which came some new additions (the additions are inserted in
ascending order).
Probably I missed that I needed tail recursion. I probably should look for a
better source of information on this.
How would it compare with e.g. merge sorting, as very large blocks of the
data are already sorted and therefor must only be merged and only require
very little sorting..
All entries are guarenteed to be unique. Space will probably 24 byte string
per entry (only the data).
Hope this makes it a bit clear.
- Joris
- Posted by Willem on October 31st, 2005
Joris wrote:
) "Willem" <willem@stack.nl> wrote in message
) news:slrndmap85.1eho.willem@toad.stack.nl...
)> You should tweak your quicksort so that it is efficient on largely ordered
)> sets(1). If it causes a stack overflow, you implemented it badly
)> enyhow(2).
)>
)> Or did you use some standard function ?
)
) I took an example from a book, the "C Handbook". They noted its not the
) fasted, but not too much information. Wikipedia I probably didn't fully
) follow.
Ah, thought as much..
)> 1) To make quicksort behave on almost-ordered sets, pick several items
)> from your sublist and take the median one as your pivot. Three should
)> suffice.
)
) Its more that its consist of a lot of ordered sequences. So a large part was
) ordered, after which came some new additions (the additions are inserted in
) ascending order).
The same reasoning still applies. If at your current recursion step, the
sequence you have is ordered, and you pick the first item as your pivot,
then it will be divided into an empty part and a part that is only 1 item
smaller. If you pick the first, middle and last items and then use the
middle one as pivot, you've got a much better chance to get a reasonable
division.
)> 2) If you want to make sure that quicksort doesn't use more than O(log N)
)> stackspace, you have to implement it as tail recursion, and recurse into
)> the smaller partition first. Alternatively, you could implement your
)> own stack, allocating fixed space beforehand in the order log N entries.
)
) Probably I missed that I needed tail recursion. I probably should look for a
) better source of information on this.
If you don't tail recurse, then stack space is O(N) in the worst case.
With a badly designed algorithm, sorted data *is* the worst case.
) How would it compare with e.g. merge sorting, as very large blocks of the
) data are already sorted and therefor must only be merged and only require
) very little sorting..
Mergesort would still descend into smaller and smaller partitions.
It doesn't magically know that the data happens to be sorted.
In fact, the running time of mergesort is always the same(*).
*) If you were using linked lists, there would be one optimization that
would give a speedup of upto a constant factor of 2.
)> How much space would one entry take normally ? And how similar are your
)> entries ? If they often share the same prefix, you should look into
)> tries.
)
) All entries are guarenteed to be unique. Space will probably 24 byte string
) per entry (only the data).
Hmm, then there's something wrong with your hash table implementation.
24 bytes of string, plus 4 bytes for a pointer entry, and maybe another
4 bytes if you're using the chain-link hash approach.
So, for 40000 entries, that would be somewhat over a megabyte.
But the way a hashtable works is that you only fill it halfway,
so you'd actually be using two megabytes. Sorting is a lot more
memory-friendly, because you only need room for the entries you have.
)> It depends a lot on what your data looks like exactly.
)
) Hope this makes it a bit clear.
Well, you still didn't say if the 24-byte strings tend to have similar
prefixes or not.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by Richard Harter on October 31st, 2005
On Mon, 31 Oct 2005 13:36:17 +0000 (UTC), Willem <willem@stack.nl>
wrote:
[snip much that is good]
It isn't quite true that the running time of mergesort is always the
same. When merging two runs of length n the number of comparisons
required may be anything between n and 2n-1.
Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
I started out in life with nothing.
I still have most of it left.
- Posted by CTips on October 31st, 2005
Question: have you considered bringing the records over as a text file,
and then using textutils (sort, join in particular) to do your
processing for you?
- Posted by moi on November 2nd, 2005
Joris Dobbelsteen wrote:
How ? By key ? By index or slotnumber ?
An open cursor, I presume ?
A point of concern is the access time. If you are forced to use
Row-At-A-Time processing, this will be about 1ms (network)
to 10ms (disk) *per record*. At best, that is.
So, your method should IMHO be focussing on getting the data first.
If you are only interested in
A & ~B
and
B & ~A
and *not* in A & B, then you could
* put A in a 'table' (trie/tree/hashtable/skiplist)
* for every element of B:
* look it up in table
* if it exists: delete it
* if not: create it, and flag it.
After this the table contains unflagged (only in A)
or flagged (only in B) records.
Note that given the "table" neither A nor B need to be sorted
in advance. A hashtable for 1M * (24+xx) entries seems possible to me.
HTH.
AvK