Tech Support > Computers & Technology > Programming > hash table idea good or no good?
hash table idea good or no good?
Posted by Michael B Allen on November 17th, 2003


I have an idea for a hash table implementation but it's so simple there
must be something wrong with it.

The usual suspects for handling collisions are chanining, open addressing,
linear probing, etc. But in the event that there's a collision, why
not just try again with an entirely separate table. The design is very
simple and the table can grow and shrink without rehashing every element.

For example, consider inserting an element with the C-ish code below. The
hash is generated and the data is placed into the first table of 31
elements unless it is already occupied in which case the next table of
61 elements is tried and so on. Memory to back each table is allocated
as necessary.

What do you think?

Mike

static const int table_sizes[] = {
31, 61, 127, 251, 509, 1021, 2039, 4093,
8191, 16381, 32749, 65521, 131071, 262139, 524287, 1048573
};

int
hashtable_put(key, data)
{
level = 0

while (level < 16) {
hash = hash_fn(key)
if (table[level] == NULL) {
table[level] = alloc(table_sizes[level])
}
ent = table[level][hash / table_sizes[level]]
if (ent is occupied) {
level++
continue;
}

*ent = data

return 0
}

return -1;
}

Posted by Christian Gollwitzer on November 17th, 2003


Michael B Allen wrote:
I think, it's a bad idea because it performs very bad in the worst case.
If you insert 17 elements that hash to the same value (and your hash
function is independent from the table size !), the fiorst 16 will be
placed in different tables, and the function claims there is no space
for the 17th element. But you've alloced about two million cells so far,
and that's not enough to hold 17 entries.
If you consider linear probing etc. in the worst case (which is
apparently the same), the insertion time goes up to O(n), but the table
is still able to swallow entries. If you want improve the performance in
the worst case, consider organizing the buffers for overflow as binary
trees. This gives you O(log n) in the "average worst case" (hash to same
value, but scattered keys) and you can reach O(log n) in the worst case
always if you choose balanced trees.


--
Vale !
Christianus Auriocus


Posted by Omri Barel on November 17th, 2003


Michael B Allen wrote:

It seems to me that this is completely equivalent to having a hash
function hash(key,idx) for which hash(key,0) is 0..30, hash(key,1) is
31..91 and so on.

However, there is one major disadvantage - after you fill the first
level, you're bound to have at least one collision for each additional
element, even if you've allocated the next 10 levels. After you fill the
second level you'll have 2 guaranteed collisions, and so on. This would
be worse than normal schemes (where there may be collisions, but they're
not guaranteed in the same way).

And one more thing, I suppose you meant

and not

Otherwise most of your keys end up mapped to 0,1,2 (or, alternatively,
you risk getting an out-of-bound index).

In any case, you can always try your scheme with a large number of keys,
and check running times, collision rate etc.


Posted by CBFalconer on November 17th, 2003


Michael B Allen wrote:
Consider the worst case storage use. The first 16 entries all
hash to the same value, thus allocating 31 + 61 + .... + 4093 +
.... + 1048573 entry tables, or roughly 2,000,000 entries. The
system with rehashing is still working on the initial 31 entry
table. With a decent rehash function it probably takes no more
than 3 probes per entry average.

Rehashing to expand the table is not especially onerous. You can
find a complete implementation, with several usage examples, in
hashlib.zip at:

<http://cbfalconer.home.att.net/download/>

and the system keeps statistics on its own performance. Tables
are known to be between roughly 45% and 88% full at all times
after the second expansion.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!



Posted by Michael B Allen on November 17th, 2003


On Mon, 17 Nov 2003 04:34:26 -0500, Omri Barel wrote:
Not exactly but your point is sound; a single much larger table would
have the same properties. The advantage of having several layers of
tables is that the storage capacity can grow very large. It's also not
difficult to shrink and reclaim memory without rehashing each element.

True. But it's not clear to me how "major" the disadvantage. The asymtotic
growth is still pretty good if it takes 8N to access on in 31 + 61 +
127 + 251 + 509 + 1021 + 2039 + 4093 elements. Also, the table sizes
could increase more agressively to reduce the number of
guarenteed collisions. Some analysis might find a "sweet spot".

Yes I did. Absolutely needs to be modulo.

Exactly.

Mike

Posted by Michael B Allen on November 17th, 2003


On Mon, 17 Nov 2003 04:19:38 -0500, Christian Gollwitzer wrote:
Not quite. First, the chances of "17 elements that hash to the same value"
is extreemly small. There would need to be a 32 bit interger for which
modulo each of the 16 primes generated the same set of values. In fact
you could probably find a set of primes (or better still a set of sets
of primes) for which it is impossible.

However, it is still a not-so-good idea because in the worse case
(provided you worked out the above) it is still possible and likely
that there are 17 elements that could hash to values that are occupied in
each level causing more memory to be allocated even though the previous
levels are poorly utilized. The overall outcome is a perpensity for very
poor memory utilization.

Mike

Posted by Christian Gollwitzer on November 18th, 2003


Michael B Allen wrote:


of hashing), you will always have collisions (simple set theory). Maybe
you can avoid this one extreme case, but I think it is basically no good
idea. Why not use rehashing?

In your pseudo-code, you don't have an entirely different hash function.
All you do is dividing that hash_value of the key modulo some prime. If
all keys have that same primary hash_value, than the first you insert
goes to table1. The second finds this cell obsessed, goes to table2. But
the hash_fn(key) has not changed (it does not depend on the level). What
changed is only hash_fn(key) % prime i.e. the adress in that table. If
you now insert a key with the same hash_value, it will find
hash_fn(key) % prime1 in table1
hash_fn(key) % prime2 in table2
obsessed and so goes to table3. See? The only requirement is that
hash_fn(key) is the same. You'd have to use entirely different hash
functions at each level (*not* first hashing to hash_fn(key))



--
Vale !
Christianus Auriocus


Posted by Michael B Allen on November 19th, 2003


On Tue, 18 Nov 2003 19:08:38 -0500, Christian Gollwitzer wrote:

I meant to say "There would need to be two 32 bit integers for which
....". Meaning if you pick two 32 bit integers at random the chance that
they will generate the same set of modulo values with each of the 16
primes is very very small.

If? It unsigned long is 32 bits and the primes I listed are much smaller
than that then indeed the key-space is clearly much larger than the
number of cells in each level.

Well your right; by itself this is not a good idea. I think it the worst
case can be avoided however if it is combined with another technique. For
example you could just maintain a list to catch all collisions until
the list reaches some fraction of the size of the current level at
which point you allocate the next level and re-insert all items in the
collision list. That way you remove the chance for pathologically poor
memory utilization but when you resize you only reinsert a fraction
(e.g. 1/8th) of the number of items in the current level.

This levels technique does have a very good resizing properties so I'm
not ready to give up just yet. The map can be very small or very large
and grow and shrink as needed without too much overhead.

Because that would not assist with resizing the table. You would
have to either use a fixed size table or realloc memory and reinsert
*everything*. The expect this levels technique to grow and shrink much
faster. The problem is there's a small chance it will create very poor
memory utilization so I will need to find a solution for that.

That was a mistake in my very crude pseudocode. I meant to calculate
the hash *once* and use % rather than / when finding the index in each
table. Also notice it doesn't really matter that different primes are
used. That just permits reaching very high capacity with a small finite
set of tables.

I'm not sure if you're referring to the mistakes in the pseudocode or
if you're trying to tell me there's another problem. Currently I'm not
aware that there's a problem with using the same hash value with each
level so please explain further if you think there is a problem.

Mike

Posted by Christian Gollwitzer on November 19th, 2003



three different uses of your function, for which I'll construct the
worst case storage.

Case a)
You want to use strings as the key. Your hash=hash_fn(key) returns some
hash (say hash_fn(key)=MD5(key)) which than serves as the primary key.
To compute the index for table[i], you use

table[i][hash % prime[i]]

The flaw here is, that different keys may hash to the same hash_fn(key).
If you take 17 strings that have the same hash=hash_fn(key), then you
will end up with the table full with 17 entries. That's what I ment with
my first message

Case b)
You want to store strings, but you use the entire string as one big
number = hash. Then your function accesses the values as

table[i][key % prime[i]]

Because in this case the key is unlimited in size, it is easy to find a
series with the same property as above:
key1=prime1 --> hashes to 0 in table 1
key2=prime1*prime2 --> hashes to 0 in table1, than to 0 in table2
..
..
..
key[n]=product(prime[k], k=1..n)


If the keys are inserted in this order (Note: The performance of your
scheme depends on the insertion order - this asymmetry is one of its
flaws), then again you will have obsessed each cell 0 in each table
causing O(exp(n)) storage

Case c)
Your keys are really only 32 bit integers. In this case the series above
won't work, because if the products are truncated, this will corrupt the
least common multiple scheme used above. An alternate scheme I tried was
much better in storage (O(2^n)), but may be it can be improved to
something polynomial:
Store 0.
Store p1. This will alloc table2, since it is hashed to zero in table 1
store p2%p1 //goes to table1
store p2 // goes to table3, because p2%p1 in t1 and 0 in t2 are used
store p3%p2%p1 //goes to table 1
store p3%p2 // goes to table 2
store p3%p1 // goes to table 1
store p3 // to table 4 , because hashes to zero in t3

As can be derived easily, to alloc level n, I need 2^(n-1) values at
maximum. There is a quite good chance that I don't need them all. This
scheme can be improved with the idea from b) given that the product of
two primes always hashes to zero for these two primes. May be with
number theory magic one can find even worse cases.


So in all cases you gave bad storage performance guaranteed for one
particular insertion order of keys. (OK, to be honest, I could not proof
it in the last case, but I suspect one can get it with O(n^2)).


If you want to write optimize the hash table, *and* combine with another
technique, it may be successful. However, for read optimization it is no
good idea because you have approx. O(log n) steps for reading if the
table is full. (Considering your primes are approx. c^n, which is sensible)



--
Vale !
Christianus Auriocus


Posted by Michael B Allen on November 20th, 2003


On Wed, 19 Nov 2003 12:06:47 -0500, Christian Gollwitzer wrote:


The key is a void * that will be passed to a user defined hash function
for generating the hash value.

Sure, but the likelyhood of the first 17 entries generating the same
hash value is very very small. Still I acknoledge that case will need
to be dealt with to prevent someone from deliberately trying to exploit
the problem.

I don't really understand what you're saying here. If the hash function
is poor that's a different issue. Otherwise considering the below code
my belief is that the different keys will generate an even distribution
of hash values except in pathological cases. Are you trying to tell me
there is some mathmatical property that is preventing an even distribution
of table locations?

int
hashmap_put(struct hashmap *h, void *key, void *data)
{
unsigned long hash;
int level;

hash = h->hash ? h->hash(key) : hash_ptr(key);

for (level = 0; level < 16; level++) {
int table_size = table_sizes[level];
struct entry *table = h->tables[level];
struct entry *e;

if (table == NULL) {
table = calloc(table_size, sizeof *e);
if (table == NULL) {
return -1;
}
h->tables[level] = table;
}

e = table + hash % table_size;

if (e->key == NULL) {
e->key = key;
e->data = data;
h->size++;
return 0;
}

if (h->cmp(key, e->key) == 0) {
errno = EEXIST;
return -1;
}
}

errno = ENOSPC;

return -1;
}

True. The size of the collision list is a problem. That was a bad
idea. The resize operation (meaning the allocation of the next level and
reinsertion of elements in the collision list) should be driven by the
*load factor*. If the load factor is below say 0.75 then an element that
collides in the last level goes in a "collision list". I believe that
would resolve the poor storage property and it would perform well under
normal conditions where hash codes are evenly distributed. The only bad
property is it will perform in O(log n) time with the pathologically bad
cases you described. But again under practical conditions that should
not be observed.

Mike

Posted by Christian Gollwitzer on November 20th, 2003


Michael B Allen wrote:
is what you do.

your key is. Now it is clear. Two comments to your code:

1. You claim it is unlikely that the first 17 keys hash to the same
value. As this is probably true, it is much more likely that (given the
user chooses a bad hash function) e.g. from 2000 entries inserted some
16 hash to the same value. In this case, two million entries are alloced
causing a storage ratio 1:1000. You cannot safely assume the user has a
very good hash function. The standard technique requires O(n) time in
this case, but also O(n) space to store all. Your technique requires
O(n) time but O(exp(n)) space in this worst case, which is clearly
inferior.


2. In the *best case*, your storage is O(n). But the time required for
storing *one entry* is not O(1), but *O(log(n))*, where n is the number
of entries.

Proof: Keys are evenly distributed, your table sizes follow
approximately the scheme c*2^k, k is the number of the table-level.
(This is a reasonable choice)
Now in the best case, the first c*2^0 items fill up table1.
The next c*2^1 entries fill table2. The next c*2^2 fill table3.
Now consider I want to store an item, and the tables 1..k are already
full. Then the for-loop runs n times to find an empty cell. Since the
items already contained in your map are

n=sum(c*2^i, i=0..k-1)=c*(2^k-1),

you need k~log(n) steps to find an empty cell, if the table contains n
items. This means storing a new element is logarithmic in the number of
items already contained, not O(1) as expected from the hash map. The
reason is simply, that you do not exploit the fact that table1 is full
and won't except further items.

Another try:


int
hashmap_put(struct hashmap *h, void *key, void *data)
{
const double loadfactor=0.75;
unsigned long hash;
// level is an item of struct hashmap, initially = 0
// also items_in_this_level, initially = 0


int table_size = table_sizes[h->level];
struct entry *table;
struct entry *e;

hash = h->hash ? h->hash(key) : hash_ptr(key);


// resize only if loadfactor reached
if (table_size*loadfactor < h->items_in_this_level) {
h->level++;
if (h->level>16) return -1;

h->items_in_this_level=0;
table_size = table_sizes[level];
table = calloc(table_size, sizeof *e);
if (table == NULL) {
return -1;
}
h->tables[h->level] = table;
// now insert all items from the overflow list into the table,
// perhaps getting new collisions
for (int i=0; i<h->overflow_list_size; i++) {
entry = h->overflow_list[i];
hash = hash_fn (entry->key);
e = table + hash % table_size;

if (e->key == NULL) {
*e = entry;
h->overflow_delete(i);
}
}

} else {
table= h->tables[h->level];
}


e = table + hash % table_size;

if (e->key != NULL) {
//insert into overflow buffer
h-> overflow_insert(e);
h->items_in_this_level++;
return 0;
}


e->key = key;
e->data = data;
h->size++;
h->items_in_this_level++;
return 0;
}

Note: this is pseudocode, I've not tried to compile.
It has another problem: It does not notice when one key is inserted
twice. However, the read performance is really bad. You'll get O(log n)
again for searching one item. However, it does not suffer from the
exponential storage problem: If you insert the same key many times, the
overflow list simply grows, but it does not alloc exponentially
increasing memory, which is your biggest problem. But searching an item
isn't linear as in hashing with one table.

Have fun,


--
Vale !
Christianus Auriocus


Posted by Michael B Allen on November 20th, 2003


On Thu, 20 Nov 2003 07:58:11 -0500, Christian Gollwitzer wrote:

I don't understand what you're saying. Clearly the time required to store
one entry is better than O(log(n)). A linked list would be O(log(n)). The
technique I described would be the number of levels (nl) plus the log
of the number of elements in the collision list (nc). The size of the
collision list is the number of collisions occuring as the load factor
grows from ~0.37 to 0.75 before the next level is allocated and the
load factor drops to 0.37 (1/2 of 0.75 because the total cumulative
size of the map (n) doubles). I don't know what that size is yet but
it's certainly not n.

I don't think this is a very good idea either. It will only fill 75%
of the table at each level which is a pretty bad waste of space in itself.

If the same key is inserted twice I will return an error.

Again, I don't see where you get linked list asymtotic growth from this.

I was thinking more along the following:

int
hashmap_put(struct hashmap *h, void *key, void *data)
{
unsigned long hash;
int level;

hash = h->hash ? h->hash(key) : hash_ptr(key);

for (level = 0; level < 16; level++) {
int table_size = table_sizes[level];
struct entry *table = h->tables[level];
struct entry *e;

if (table == NULL) {
iter_t iter;
struct entry *e;
struct linkedlist *col;

table = allocator_alloc(h->al, table_size * sizeof *e, 1);
if (table == NULL) {
return -1;
}
h->tables[level] = table;

col = (struct linkedlist *)&h->col[h->whichcol];
h->whichcol ^= 1; /* alternate collision list */
/* reinsert all elements */
if (linkedlist_clear(col, (del_fn)entry_put, h) == -1) {
return -1;
}
}

e = table + hash % table_size;

if (e->key == NULL) {
e->hash = hash;
e->key = key;
e->data = data;
h->size++;

return 0;
}

if (hash == e->hash && h->cmp(key, e->key) == 0) {
errno = EEXIST;
return -1;
}

if (h->tables[level + 1] == NULL) {
unsigned int lf = h->size * 100 / total_sizes[level];

if (lf < h->load_factor) { /* add to collision list */
struct linkedlist *col = (struct linkedlist *)&h->col[h->whichcol];
struct entry *e = allocator_alloc(h->al, sizeof *e, 1);

if (e == NULL || linkedlist_add(col, e)) {
return -1;
}
e->hash = hash;
e->key = key;
e->data = data;
h->size++;

return 0;
}
}
}

errno = ENOSPC;

return -1;
}
void *
hashmap_get(const struct hashmap *h, const void *key)
{
unsigned long hash;
int level;

if (h->tables[0] == NULL) {
return NULL;
}

hash = h->hash ? h->hash(key) : hash_ptr(key);

for (level = 0; level < 16; level++) {
int table_size = table_sizes[level];
struct entry *table = h->tables[level];
struct entry *e;

e = table + hash % table_size;

if (e->key && hash == e->hash && h->cmp(key, e->key) == 0) {
return e->data;
}

if (h->tables[level + 1] == NULL) {
iter_t iter;
struct entry *e;
struct linkedlist *col = (struct linkedlist *)&h->col[h->whichcol];

/* search the collision list */
linkedlist_iterate(col, &iter);
while ((e = linkedlist_next(col, &iter))) {
if (hash == e->hash && h->cmp(key, e->key) == 0) {
return e->data;
}
}

break;
}
}

return NULL;
}

Posted by Christian Gollwitzer on November 21st, 2003


Michael B Allen wrote:
A binary tree has O(log(n)) in the average case and O(n) in the worst case.
And the number of levels is approximately c*log(n), as I worked out
above. The constant c may be very small, in this case you can forget
about it for only few levels.



Assume I use the hashmap once for inserting a dictionary and then for
looking up values many times. Assume I've inserted roughly two million
items and there were no or only few collision. Then about half of the
items, i.e. one million, is in table[15]. Because you look first in
table[0], then table[1] etc. you need 16 accesses until the item is
found for 50% of the items. 15 accesses for 25% and so on.
So the average number of accesses for an item is:

sum(k*2^(k-17), k=1..16) = 15.00001525....

So the average number of levels consulted is approximately fifteen. If
you'd used rehashing, than all elements not in the collision list would
be found after only one access, which is clearly faster. So for this
kind of use your approach is not suitable. It maybe for others. Try
running it on a large example (canonical one is counting the frequency
of all words in a large textfile) and compare with other
implementations, like the hashmap from STL or CBFalconers
implementation. Hard to say from theory, which one is the best in these
cases. Of course you need to use the same hash function for such a test.
Measure time and storage used. And try with one particularly bad "hash
function" that simply always returns one to see what happens in case of
many collisions.


--
Vale !
Christianus Auriocus


Posted by Michael B Allen on November 23rd, 2003


On Fri, 21 Nov 2003 05:59:44 -0500, Christian Gollwitzer wrote:
You're right. I was thinking of average time.

Ok, I understand this.

This I do not understand. It may be necessary to search ever position
with rehashing whereas with the levels scheme you're guaranteed to
eliminate half of the table positions with each test of a level.

Well numbers are king of course. I tried to run some tests like
you described but unfortunately I could not get my levels scheme to
actually work. It was dropping elements and leaking like a sieve. I have
implemented rehashing with resizing and it does work. It is consideribly
simpler too. Even if the levels scheme turned out to be faster (and
I'm still not convinced that it's not) I greatly prefer simplicity so
I would probably stick with the rehashing implementation anyway.

You win!

Mike


Similar Posts