- Efficient algorithm for finding duplicates in 56-bit number
- Posted by Heiner Steven on January 28th, 2005
Hello,
I'm searching for an memory-efficient (and hopefully fast)
algorithm for detecting duplicate 56-bit numbers.
The input consists of millions of lines like this:
0X01234567890abcdef01234567890abcd
Each line is a textual representation of a 56-bit hex number.
The file should be read into memory and stored there in a
memory-efficient way.
Afterwards new numbers are read from somewhere else, and
the program needs to know if the number exists already
or if it is a new, unique one.
What would be a good data-structure for this purpose?
"gzip" shows that there's a lot of redundancy in the text file
(factor >20). However, keeping the compressed data in
memory is no option, because searching it would be too slow.
Heiner
--
___ _
/ __| |_ _____ _____ _ _ Heiner STEVEN <heiner.steven@nexgo.de>
\__ \ _/ -_) V / -_) ' \ Shell Script Programmers: visit
|___/\__\___|\_/\___|_||_| http://www.shelldorado.com/
- Posted by spinoza1111@yahoo.com on January 28th, 2005
A "hash" data structure, accessed for example on the first n bits of
each number, would store only unique numbers, and, as you have found
from gzip, there is a lot of redundancy. Look up "hashing".
- Posted by Heiner Steven on January 29th, 2005
spinoza1111@yahoo.com wrote:
I know how hashing works. But in that case I would have to
store all numbers + a hashing table in memory. Storing the
numbers in a more concise format would be better.
Additionally, hashing would be fast, but the hashing table would take
additional memory.
Heiner
--
___ _
/ __| |_ _____ _____ _ _ Heiner STEVEN <heiner.steven@nexgo.de>
\__ \ _/ -_) V / -_) ' \ Shell Script Programmers: visit
|___/\__\___|\_/\___|_||_| http://www.shelldorado.com/
- Posted by CBFalconer on January 29th, 2005
Heiner Steven wrote:
A hashtable. You just happen to be in luck, because I have
published a free hashtable implementation under GPL. If you happen
to want it for commercial purposes that can be discussed. It is
available, with examples and tests, at:
<http://cbfalconer.home.att.net/download/hashlib.zip>
You can decide for yourself whether or not to convert your input
strings into binary numbers internally. I would suggest making
your initial tries by storing the actual string and not worrying
about it yet. Since they are hex the conversions will be trivial.
--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
- Posted by Chris Williams on January 29th, 2005
Heiner Steven wrote:
Perhaps a trie? The idea is that you create a special sort of tree with
a node-type that has 16 slots and a 16-bit flag item at the top.
struct trie16 {
unsigned short itemPointer;
/* ^ each bit is a boolean value telling us whether or not each below
pointer is a pointer to a string (1) or a pointer to another trie16
structure */
void* _0;
void* _1;
void* _2;
void* _3;
void* _4;
void* _5;
void* _6;
void* _7;
void* _8;
void* _9;
void* _a;
void* _b;
void* _c;
void* _d;
void* _e;
void* _f;
}
You then use each of your strings as a "map" to guide you through the
trie. So if you start with just one empty node, then when you add
0X01234567890abcdef01234567890abcd, you will put a pointer to it in the
_0 slot and set the flag for that item to 1. If we then put in
0x11111111111111111111111111111111, this will go in the _1 slot. Then
we try to add 0x00000000000000000000000000000000, which we will want to
put in _0, but the slot is already filled.
First we compare the strings. If they are exactly the same, then we can
already consider it as stored. Otherwise we create a new node and move
0x01234... into the _1 slot of the second note and 0x00000... into the
_0 slot. The _0 slot of the top node will be pointed to this new node
and the flag set to 0. Note that is the two values were identical for
several characters then we might have had to create three or four more
nodes to extend the trie out to where the difference began.
I haven't done enough playing around with structures like this nor
hashes to know what the relative speed/memory usage rates are like.
Most programmers advocate hashes and I generally have to assume that
they know what they are talking about...but just in case here is one
other thing to look up and try =)
-Chris
- Posted by Chris Williams on January 29th, 2005
And best not to preoptimize, but if you really wanted to get
memory-stingy with your trie, one trick is to drop as many of the
hexadecimal decimals as you already have mapped when you store the
original string. But you would need to make sure that when you have to
move that string to a new, deeper node that you increment the pointer
you saved so that 56 - (node depth) == remaining number of hex values
the pointer points to.
And of course don't save your hex values as 8-bit ascii strings =)
-Chris
- Posted by spinoza1111@yahoo.com on January 29th, 2005
Heiner Steven wrote:
I apologize for the patronizing tone of my reply.
First of all, you don't have to store an extra hashing table in memory
(cf. Cormen et al. 2000: "Introduction to Algorithms", mit press). Open
hashing, which is just as efficient and far more elegant than links,
simply sets the target entry to the number.
IMO, you need to impose "separation of concerns" on hashing versus "a
concise format". Use hashing to search and store. If conciseness is
needed, select a good way to pack the values. If they are, for example,
56 byte strings, restricted to letters and numbers, as appears, they
are in fact base 36 values...which can be represented by a rather small
array of long, 8 byte integers, far less storage in fact than a 56
character string at each location.
Your library may even have an "extra precision" handler.
Nope. 'Twon't. I haven't used links to implement hashing since 1974.
Open addressing is just as fast. For a clearer discussion of the
practicalities, see Ch 8 of my book (Build Your Own .Net Language and
Compiler, Edward Nilges 2004, Apress).
The worst case for both styles of hashing is pretty bad, because in the
worst case, all your values hash to THE SAME LOCATION, and the hash
table is just a random table using linear search. But what this means
is you've selected an improper hash algorithm!
- Posted by spinoza1111@yahoo.com on January 29th, 2005
My recommendation, storage of the strings as a small array of Long
integers, does mean that to "compare" values during your hash search
(or other form of search) you have to compare two small arrays (the
input and the entry in the hash table) in a small loop.
But if you arrange the Long integers in small-endian fashion, then that
loop isn't constant K time, it is K/2 time.
It might be actually better to check "right to left" in the K/2 compare
because if memory serves, many of the values have common front ends.
This also means that your hash function can be efficiently based on the
last digits, perhaps the least significant integer, mod the size of the
hash table.
Also, the size of the hash table should not be considered a constant.
Painfully, while processing a lot of numbers you can even expand the
hash table on the fly by creating a new table and re-hashing each
entry...assuming that your hash function is something mod the size of
the hash table.
The benefit is that your code will almost never snarf out as the size
of the data base increases.
I don't think this expensive reorganization, which is an advanced
feature, can be avoided for the simple reason that IF the hash function
(which converts an entire number to a number in the range of the table)
does not depend on the input number, then there is no reason why
identical numbers would map to the same place...which the entire scheme
sorta depends on.
There are also subtleties, documented in the reference to Cormen, about
your selection of the size of the hash table. Knuth's Sorting and
Searching volume in his series on The Art of Computer Programming
discusses these more clearly, but Cormen has more recent research
results.
In fact, recent research has discovered all sorts of minimal functions
but my experience, based on a combination of pragmatism and laziness,
is to take the end of the input strings as being the "most random" in
practice. The beginning in the case of symbols in a programming
language is "least random" almost universally, especially if the code
uses Hungarian notation.
What you want is a random-appearing distribution of the hash targets,
not strange clusters and blobs, for for each strange cluster, each
blob, there are more and more linear searches. One reason I still use
VB is that it's easy to create a nifty visual demo of clustering.
But the central magic of hashing works as long as the key is input to
the hash function and the relation of the key to possible positions is
one-many. Indeed, the elegance of open hashing is an attractive
combination of precision (hash function applied to key goes to same
place) followed by apparent messiness.
It amuses me to think of an expensive and precise machine tossing
values as the wild sheep defecate yet to recover them precisely. It is
for me an emotional deconstruction of the rhetorical presentation of
digital computers as necessarily rigid.
There's a subtlety in open hashing if you delete entries (doesn't sound
like you do). It is that you must use a special marker for a deleted
entry! This is because an EMPTY entry proves that a key does not exist,
whereas a match may be found in the search (starting from the hash
location, and as needed wrapping, to stop at the original entry) after
the deleted entry.
Of course, today, many hash implementations such as are found in .Net
encapsulate good praxis. The problem is that other efforts encapsulate
Stupid Hash Tricks. It is also that completing a good implementation
after actually reading Cormen (reading Nilges is sadly not enough,
since my book develops a compiler in which we can assume a good hasher)
is itself rewarding in a Promethean sense. In fact, it is a rather
small amount of code.
- Posted by spinoza1111@yahoo.com on January 29th, 2005
The hash table is far more easily explained...a strong argument in its
favor. Its downside is that its worst case behavior sucks and is
equivalent to linear search. Even before approaching worst case,
unexpectedly patterned input will degrade the hash table.
Of course, a self-conscious "hash object", being an object whose state
can readily contain performance information, can "notice" when strange
clusters and blobs start to appear, by maintaining the maximum
"reasonable" search length in object state.
This length should be a percent of the space available so that the code
exhibits sensitivity to the nonconstant and arbitrary nature of the max
size of the table.
When a blob appears, the table can be expanded and reorganized. The
trouble is the time it takes to reorg, but this only occurs as needed.
In my experience with early releases of data bases, such bricolage
created strange pauses in the processing of user data.
Therefore, the hash algorithm's applicability needs to be considered
carefully.
- Posted by Marc Poulin on January 29th, 2005
Heiner Steven wrote:
Here's a Perl script to test hashing speed:
############################################
%names = {};
$n = "0X01234567890abcdef01234567890abcd";
for ($x=0; $x<4000000; $x++)
{
$names{$n} += 1;
}
print "Found $n ", $names{$n}, " times\n";
############################################
Hashing your example string 4 million times
takes approximately 19 seconds on my 66 MHz
Pentium II.
Hashing in Perl is pretty efficient. I
don't think you will find any implementations
significantly faster than this.
- Posted by Daniel Lidström on January 29th, 2005
On Sat, 29 Jan 2005 01:32:13 +0100, Heiner Steven wrote:
If you are using C++, you can try the std::set container. Something like
this:
#include <algorithm>
#include <iterator>
#include <iostream>
#include <vector>
#include <ctime>
#include <set>
using namespace std;
//! for gcc:
typedef unsigned long long uint64;
//! for VC:
//typedef unsigned __int64 uint64;
struct generator
{
mutable vector<uint64>& m_subset;
generator(vector<uint64>& v) : m_subset(v) { }
//! called by generate_n
uint64 operator()() const
{
uint64 num = uint64(rand()) << 32 | rand();
if( !(rand()%100) )
m_subset.push_back(num);
return num;
}
};
//! this is a function object that acts as a counter
struct counter
{
//! keep a reference to the list of unique numbers
const set<uint64>& m_v;
static int numbers;
counter(const set<uint64>& v) : m_v(v) { }
//! called by for_each
//! counts the numbers that exist in m_v
void operator()(uint64 number) const
{
if( m_v.find(number)!=m_v.end() )
numbers ++;
}
};
int counter::numbers = 0;
int main()
{
//! seed random number generator
srand(time(0));
//! this container holds unique numbers
//! the number is both key and data
set<uint64> unique;
//! put one million numbers into container
//! keep 1% of generated numbers here:
vector<uint64> subset;
generator g(subset);
generate_n(inserter(unique, unique.begin()), 1000000, g);
//! now time the lookup of the saved subset
time_t t = time(0);
counter c(unique);
cout << "Size of subset: " << subset.size() << endl;
for_each(subset.begin(), subset.end(), c);
cout << "Lookup took " << time(0)-t << "s" << endl;
cout << "Found " << counter::numbers << " numbers" << endl;
return 0;
}
On my computer (3GHz P4) the output is this:
Size of subset: 9952
Lookup took 0s
Found 9952 numbers
--
Daniel
- Posted by Matt Gregory on January 29th, 2005
Heiner Steven wrote:
Read the numbers in and convert to 7 or 8 byte values, whichever
floats your boat. Put them in an array and sort them. Find them
using binary search.
Matt Gregory
- Posted by gerard46 on January 29th, 2005
| Matt Gregory wrote:
|> Heiner Steven wrote:
|> I'm searching for an memory-efficient (and hopefully fast)
|> algorithm for detecting duplicate 56-bit numbers.
|> The input consists of millions of lines like this:
|> 0X01234567890abcdef01234567890abcd
|> Each line is a textual representation of a 56-bit hex number.
|> The file should be read into memory and stored there in a
|> memory-efficient way.
| Read the numbers in and convert to 7 or 8 byte values, whichever
| floats your boat. Put them in an array and sort them. Find them
| using binary search.
As an aside, I'd be interested to see how long (wall clock and
processor time) it would take to sort millions of those values
(and I assume that 2 million is quite less than what Heiner had
in mind when he said millions).
If the OP is looking for duplicates, all one has to do is just
sort the file and see if any of the records match the previous
one. No need to do a binary search. This boils the problem
down to having some utility do the work sorting the file, and
creating a new file if the old one is to be kept intact.
Presumably, let the utility run overnight or at some other
"slack" time. __________________________________________Gerard S.
- Posted by Willem on January 29th, 2005
gerard46 wrote:
) As an aside, I'd be interested to see how long (wall clock and
) processor time) it would take to sort millions of those values
) (and I assume that 2 million is quite less than what Heiner had
) in mind when he said millions).
)
) If the OP is looking for duplicates, all one has to do is just
) sort the file and see if any of the records match the previous
) one. No need to do a binary search. This boils the problem
) down to having some utility do the work sorting the file, and
) creating a new file if the old one is to be kept intact.
) Presumably, let the utility run overnight or at some other
) "slack" time. __________________________________________Gerard S.
I think you're overestimating the amount of time needed to sort
mollions of numbers. Basically, if it fits in memory, sorting time
is in the order of minutes, or maybe even seconds.
If it doesn't, you're best off with a sort that has very good locality
of reference, or even tailor-make one that sorts in chunks (where each
chunk fits in memory) and then merges the chunks.
You can merge the chunks in one go, if you read those in smaller chunks
so it fits in memory again, and use maybe a heap structure of pointers
to chunks to always output the smallest value.
I seem to remember there's a sorting challenge where you have to sort
like billions or trillions of numbers on big hardware.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by gerard46 on January 29th, 2005
| Willem wrote:
|) gerard46 wrote:
|) As an aside, I'd be interested to see how long (wall clock and
|) processor time) it would take to sort millions of those values
|) (and I assume that 2 million is quite less than what Heiner had
|) in mind when he said millions).
|)
|) If the OP is looking for duplicates, all one has to do is just
|) sort the file and see if any of the records match the previous
|) one. No need to do a binary search. This boils the problem
|) down to having some utility do the work sorting the file, and
|) creating a new file if the old one is to be kept intact.
|) Presumably, let the utility run overnight or at some other
|) "slack" time. __________________________________________Gerard S.
| I think you're overestimating the amount of time needed to sort
| mollions of numbers. Basically, if it fits in memory, sorting time
| is in the order of minutes, or maybe even seconds.
Yes, and that'a a heck of a big IF (if it fits in memory). Millions
(let's pick ten as a "good" number) of numbers, each requiring 56 bytes
(as I understood the OP's numbers being 56-byte hex numbers) would
require 1/2 Gigabyte or so of memory. Even if you could read all those
numbers into real memory, I don't think that it can be done in an order
of minutes, and certainly not seconds. But, I could be wrong, of
course. __________________________________________________ Gerard S.
| If it doesn't, you're best off with a sort that has very good locality
| of reference, or even tailor-make one that sorts in chunks (where each
| chunk fits in memory) and then merges the chunks.
| You can merge the chunks in one go, if you read those in smaller chunks
| so it fits in memory again, and use maybe a heap structure of pointers
| to chunks to always output the smallest value.
|
| I seem to remember there's a sorting challenge where you have to sort
| like billions or trillions of numbers on big hardware.
What size numbers, 32 bits ? _____________________________Gerard S.
- Posted by CBFalconer on January 29th, 2005
Willem wrote:
The problem doesn't require sorting. Here is a test run on my
hashlib package, run on a ten year old 486/80 machine. The values
to insert came from a 32 bit Mersenne Twister random generator.
Somewhere around 250,000 entries the system starts to use virtual
memory, IIRC. I would expect most modern machinery to execute the
same runs in about 1 or 2 seconds.
[1] c:\c\hashlib>timerun hashtest 4 100001
Timer 3 on: 16:24:27
HASHLIB test04
New table, inserting 100001 values
Status: Entries=99999, Deleted=0, Probes=461067, Misses=246947
Walking returned 0
0 items were inserted 0 times
99997 items were inserted 1 times
2 items were inserted 2 times
0 items were inserted 3 times
0 items were inserted 4 times
0 items were inserted 5 times
0 items were inserted 6 times
0 items were inserted 7 times
0 items were inserted 8 times
0 items were inserted 9 or more times
Suppressing hshkill()
Timer 3 off: 16:24:32 Elapsed: 0:00:04.62
[1] c:\c\hashlib>timerun hashtest 4 1000001
Timer 3 on: 16:24:38
HASHLIB test04
New table, inserting 1000001 values
Status: Entries=999877, Deleted=0, Probes=5861138, Misses=3026865
Walking returned 0
0 items were inserted 0 times
999753 items were inserted 1 times
124 items were inserted 2 times
0 items were inserted 3 times
0 items were inserted 4 times
0 items were inserted 5 times
0 items were inserted 6 times
0 items were inserted 7 times
0 items were inserted 8 times
0 items were inserted 9 or more times
Suppressing hshkill()
Timer 3 off: 16:25:50 Elapsed: 0:01:12.01
The runs avoided the final freeing of the data entry storage, which
is an O(N*N) operation on many systems. Finding this out inspired
my nmalloc package for DJGPP, which doesn't have this fault.
hashtest ? gives help and shows how to avoid this.
Here is a test run showing the effect that free operation can
have. One run uses the nmalloc package, and one uses the
original. The time went from 4 seconds to well over 4 minutes.
The inputs are identical because the random package was not seeded.
[1] c:\c\hashlib>timerun hshtstm 4 100000
Timer 3 on: 16:42:38
HASHLIB test04
New table, inserting 100000 values
Status: Entries=99998, Deleted=0, Probes=461060, Misses=246941
Walking returned 0
0 items were inserted 0 times
99996 items were inserted 1 times
2 items were inserted 2 times
0 items were inserted 3 times
0 items were inserted 4 times
0 items were inserted 5 times
0 items were inserted 6 times
0 items were inserted 7 times
0 items were inserted 8 times
0 items were inserted 9 or more times
Timer 3 off: 16:42:44 Elapsed: 0:00:05.55
[1] c:\c\hashlib>timerun hashtest 4 100000
Timer 3 on: 16:43:07
HASHLIB test04
New table, inserting 100000 values
Status: Entries=99998, Deleted=0, Probes=461060, Misses=246941
Walking returned 0
0 items were inserted 0 times
99996 items were inserted 1 times
2 items were inserted 2 times
0 items were inserted 3 times
0 items were inserted 4 times
0 items were inserted 5 times
0 items were inserted 6 times
0 items were inserted 7 times
0 items were inserted 8 times
0 items were inserted 9 or more times
Timer 3 off: 16:47:28 Elapsed: 0:04:20.67
--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
- Posted by Willem on January 29th, 2005
CBFalconer wrote:
) Willem wrote:
)>
) ... snip ...
)>
)> I think you're overestimating the amount of time needed to sort
)> millions of numbers. Basically, if it fits in memory, sorting time
)> is in the order of minutes, or maybe even seconds.
)
) The problem doesn't require sorting. Here is a test run on my
) hashlib package, run on a ten year old 486/80 machine. The values
) to insert came from a 32 bit Mersenne Twister random generator.
) Somewhere around 250,000 entries the system starts to use virtual
) memory, IIRC. I would expect most modern machinery to execute the
) same runs in about 1 or 2 seconds.
Sorting, when done properly, has a lot better locality of reference.
This means that on problem sizes that vastly outgrow available memory,
the sorting solution should outperform the hashing solution significantly.
If I read your test correctly, you were running with a mere million values.
Now, I don't know how many values the OP has to contend with, but if it's
in the order of billions, I think you're going to run out of core on most
modern desktop machines.
As an aside, have you done test runs with a sorting algorithm on the same
test data sets, to get a comparison ?
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
- Posted by Ian Woods on January 29th, 2005
On Sat, 29 Jan 2005 01:32:13 +0100
Heiner Steven <heiner.steven@nexgo.de> wrote:
Any technique used has to compare favourably with the "braindead" approach: load all of the 56-bit entries into an array, sort them and then use something akin to a binary search.
The notable thing here is the memory usage - it's only 56-bits per entry. Compare this to any technique which has to store a value or pointer per entry: there's 32-bits. Also compare to any tree-like data structure which try's to exploit sparseness: if you store 64-bits per entry at the leaf nodes, and only one entry per leaf node is likely (which it is given the sparseness of the data you have), then it's already worse than the braindead approach.
The best I've been able to find, playing around in my head, is a small modification to the braindead approach which exploits the large amount of redundency in the first few bits.
If you have a data structure which does a simple lookup into a table on the value of the first, say, 8 bits to sorted arrays of entries 48-bits long you can reduce the amount of storage and improve the time to search for an entry. We could do the same process again on the 48-bit entries...
However, if you do this all the way down to the last 20 bits or so, you suddently start using more memory per entry than the braindead approach again. It would appear that any kind of tree structure applied all the way to the bottom few bits will require more memory than the braindead apprach no matter how it's organised.
The clue to the point where it's worth stopping is in the sparseness of the data structure. For each 8-bit lookup table we have 256 pointers For 'n' entries which are evenly distributed over the whole of the 0 to (2^56)-1 range we can expect n/256 entries for each array off the first 8-bit lookup table, which uses for 32 bit pointers 1KB of memory and then 6*n octets to store the remaining 48-bits of each entry. We can expect n/65536 entries if we lookup the first 16 bits, the lookup table requiring 256KB, then 5*n octets for the remaining 40 bits.
Or, in otherwords.... if we lookup the first 'b' bits of the entry, we require 4*2^b + ((56-b)*n)/8 octets of storage. If n=1 million...
b storage (octets)
0 7000000
4 6500064
8 6001024
12 5516384
16 5262144
20 4194304+4500000 ... which is big
24 is going to be much bigger..
Hopefully my random whitterings here might be useful. :P
Ian Woods
- Posted by spinoza1111@yahoo.com on January 30th, 2005
The practical problem with the braindead approach is that the guy has
millions of records, and a non-braindead sorting algorithm such as
quick or shell sort has to be used to realistically load the table.
Quick and shell sorting are more psychologically complicated, perhaps,
than open hash.
But if this is only my prejudice, the use of a sort AFTER all entries
are loaded creates a pragmatic issue.
This is that even using a good algorithm there will be a strange pause
after load during sort whereas open hash allows each entry to be loaded
with a time footprint that only degrades slowly.
In the "braindead" approach, you have to wait constant (but large) time
proportional to the size of the data set order 1 while the data set
loads, and a roughly equivalent time during sort until you are
presented with the first useful results.
Whereas an object approach using open hash has the multithread
capability to access items before any drama, before the load is
complete. During the load you can start work because some keys will
already exist, and the purpose of computers is to work us all to death,
isn't it, in paralell.
I am not merely trying to be droll: I am not at all trying to be a
troll. "Brain dead" approaches can be less maintainable and extensible
in reality, even though they are sold as such, because the code, small
and elegant AS CODE, is AS A DYNAMIC PROCESS hard to adapt to real
requirements, such as a sudden rage on the user's part to access the
table while waiting for it to be fully present.
In the brain dead approach, the table must be read-locked until it is
fully loaded and sorted. No such rule applies to open hashing.
Problem breakdown is a Good Thing. But one of the most primitive ways
to implement "separation of concerns" is to create a USELESS data set
and pipe that puppy which creates a USELESS data set, and send it down
the Fordist assembly line to the finisher process, only to have
something USEFUL at the end of the working day.
Many pipe mechanisms are of course asynchronous, but not this one.
I call this Fordist advisedly although my lawyer wants me to stop :-),
because you cannot let the customer drive the car before finishing it
but data structures are different.
Open hashing creates the final and useful product, lacking only closure
in the form of all entries, from the beginning and that is why it MAY
be useful here. It stores the minimal amount of crap (eg., zero crap)
per entry.
One negative remains the case that there is no requirement for deletion
apart from not storing duplicate keys. This means that you do have a
strong case for brain death here.
But my prejudice against "batch" and the step by step production of
Useless Things remains precisely because I recall, with less than
fondness, eternal vigils in mainframe computer rooms, glazedly watching
der blinkenlights.
Of course, real users can switch over to porn sites today or even do
useful work while a "brain dead" approach pipes garbage, only at the
end accomplishing the alchemical step.
But the argument for brain death as simplicity itself has to give way
to the realization that processes also need to be "as simple as
possible, but no simpler".
- Posted by spinoza1111@yahoo.com on January 30th, 2005
Sorting, even when done properly, does not have better locality of
reference. In the initial phases of most sorting algorithms, the
algorithms will compare distant entries.
Whereas open hash has high locality UNTIL the table is almost full. In
moving forward and then "wrapping" mod the size of the table, it stays
as close as possible, at all times, to the starting point. Even in a
nearly full hashtable, I believe, the "locality of reference", if
quantified, would prove to be best-fit.
Which means that sorting solutions will cause page faults early on for
large data sets, while open hashing will cause page faults only when
blobs and clusters exist of values with close keys.
But this problem means that the hashing algorithm was probably poorly
chosen with respect to the key values whereas the problem in most
sorting algorithms is ineradicable. Most start with comparing distant
keys.
In fact, that's how the faster sort algorithms get their power, isn't
it, while the most "brain dead" algorithm has in fact high locality. It
causes page faults, but in a regular way.