Tech Support > Computers & Technology > Programming > hash function to produce 4 byte number out of 16 byte number
hash function to produce 4 byte number out of 16 byte number
Posted by dmitry.sychov@mail.ru on October 10th, 2006


Hi

I'am looking for the best hash fn to produce 4 byte number out of 16
byte number

16 byte source number is GUID, ie it has no properties just complete
random number.

speed is not an issue here.

thank you, dmitry

Posted by Chris McDonald on October 10th, 2006


dmitry.sychov@mail.ru writes:

Best? Only you will know what that means to you.
Do you require a cryptographic hash?

--
Chris,

Posted by dmitry.sychov@mail.ru on October 10th, 2006


Hi Chris,

not at all - the 4 byte number will be used as a key in map (b-tree).

just wont to put more keys in a tree node, while not exceeding CPU
cache line.

thanks, dmitry


Posted by dcorbit@connx.com on October 10th, 2006


dmitry.sychov@mail.ru wrote:
/*
Any 32 bit hash will do then.

This algorithm (Featuring Ozan Yigit's hash algorithm) assumes 32 bits
for unsigned. Season to taste:
*/
unsigned uses_sdbm(const unsigned char *data16b)
{
unsigned hash = 0;
int c;

c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;
c = *data16b++;
hash = c + (hash << 6) + (hash << 16) - hash;

return hash;
}
#ifdef UNIT_TEST
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const unsigned char data[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16};
int main(void)
{
unsigned val = uses_sdbm(data);
printf("Hash is %u\n", val);
return 0;
}
#endif


Posted by Ben Pfaff on October 10th, 2006


dcorbit@connx.com writes:

It seems terribly wasteful, even though the OP said that speed is
not an issue, to use a bytewise hash for this. Why not use a
hash that operates on 32 bits at a time?
--
"I admire him, I frankly confess it; and when his time comes
I shall buy a piece of the rope for a keepsake."
--Mark Twain

Posted by Logan Shaw on October 10th, 2006


dmitry.sychov@mail.ru wrote:
Since all the bits of the source number are random, there is no
probability distribution that you can take advantage or that you
must worry about. The numbers will be evenly distributed. So,
as long as you don't *create* a distribution that's biased in
favor of certain values, your task is pretty easy. I would
personally use a hash based on the XOR operator, since if its
two operands are evenly distributed, so is its value. So,
maybe this:

uint32_t best_hash_fn_1 (uint8_t *in)
{
uint32_t temp[4];

temp[0] = in[0] ^ in[1] ^ in[2] ^ in[3];
temp[1] = in[4] ^ in[5] ^ in[6] ^ in[7];
temp[2] = in[8] ^ in[9] ^ in[10] ^ in[11];
temp[3] = in[12] ^ in[13] ^ in[14] ^ in[15];

return
temp[0] << 24
| temp[1] << 16
| temp[2] << 8
| temp[3];
}

However, thinking about this a little more, since you have said the
16 bytes are completely random, and since XOR-ing random bits
together doesn't give you a "more random" number, a hash that
would perform equally well (in terms of collisions) would be this
one:

uint32 best_hash_fn_2 (uint8_t *in)
{
return
(uint32_t) in[0] << 24
| (uint32_t) in[1] << 16
| (uint32_t) in[2] << 8
| (uint32_t) in[3];
}

Basically, you have 128 bits of randomness available to you. As long
as you grab 32 bits from somewhere, it doesn't matter where.

- Logan

Posted by Chris Uppal on October 10th, 2006


dmitry.sychov@mail.ru wrote:

GUIDs come in several varieties, not all even attempt to be "random"
(the idea is to be /unique/ and randomness is only one strategy towards
achieving that).

But, it you have confidence that your input GUIDs are sufficiently
random, then just mask the number down to 32-bits.

-- chris

Posted by David Spencer on October 10th, 2006


dmitry.sychov@mail.ru writes:

If it's truly random, why not just take the low (or high or ...) 4
bytes?

--
dhs spencer@panix.com

Posted by toby on October 10th, 2006



David Spencer wrote:
After all, 3 posts can't be wrong!


Posted by dcorbit@connx.com on October 10th, 2006



toby wrote:
GUIDs are not random. For some GUIDs, the MAC address is part of it.
http://en.wikipedia.org/wiki/GUID

Anyway, a simple hash will surely be sufficient for the OP's needs.

Posted by dcorbit@connx.com on October 10th, 2006



Ben Pfaff wrote:
If a profile shows a problem, then I would agree with you. I guess
that a change to a hash based on unsigned integers will not be
discernable in the grand scheme of things. The SDBM hash is very well
tested and has an excellent dispersion according to every study that I
have seen.


Posted by toby on October 11th, 2006



dcorbit@connx.com wrote:
Yes, another point made above. Everything hinges on what the OP meant
by "no properties just complete random number". It's possible this is a
misunderstanding of what the GUID really comprises. In which case, yes,
a simple hash is indicated.


Posted by bob_jenkins@burtleburtle.net on November 3rd, 2006


dmitry.sychov@mail.ru wrote:
I usually use
a += d; d += a; a ^= (a>>7);
b += a; a += b; b ^= (b<<13);
c += b; b += c; c ^= (c>>17);
d += c; c += d; d ^= (d<<9);
a += d; d += a; a ^= (a>>3);
b += a; a += b; b ^= (b<<7);
c += b; b += c; c ^= (c>>15);
d += c; c += d; d ^= (d<<11);
for that, using d as the result, but alas the reason I usually do that
is speed.