- Musings on uniqueness
- Posted by Richard Heathfield on March 3rd, 2007
This article is really just me thinking aloud, but I'd appreciate any
helpful suggestions.
Consider a very large collection of items, which needs to become a set
(i.e. no item should duplicate any other item).
We need to identify which items can safely be thrown away. But no item
should be discarded unless it is redundant (i.e. is the duplicate of
some other item that will not be discarded).
Items are of arbitrary size, and can be very, very large indeed, so we
don't wish to copy them around the place - but of course they can be
described uniquely, and we can manipulate these descriptions.
It is trivial to sort the items by size, and disregard any item that
doesn't share a size with some other item. By "disregard" I mean
"disregard as a candidate for removal" - i.e. it is known that the item
cannot be discarded.
It is also trivial to hash the items, up to a point. (I have arbitrarily
chosen to hash up to the first 2KB.) Thus, items that share a size with
other items but have a unique hash within that group can again be
disregarded.
When all that is done, we have as candidates for removal all items that
share both a size and a hash with at least one other item. But of
course just because A is the same size as B and gives the same hash
over 2KB, that doesn't mean they're identical.
What I'm wondering is how best to proceed after this initial filtering
is done. My first - and, so far, only - idea is as follows:
1) a "bag" is a collection of 1 or more items that are guaranteed to be
identical. It can present one such item for comparison purposes.
2) a "box" is a collection of 0 or more bags that all contain (0 or
more) items of a given size and hash value.
3) a "shelf" is a collection of 0 or more boxes.
So, for each item in turn, I do the following:
a) look along the shelf for a box with the right item size and hash
value. If one doesn't exist, create it.
b) for each bag in the box, compare this item with the bag's comparison
item (and remember, they may be many megabytes or even several
gigabytes in size, so this is the really costly bit). If they're
exactly the same, put this item into the bag and stop. Otherwise, go on
to the next bag. If this item doesn't belong in any existing bag, make
a new bag for it, and put it in. It becomes the comparison item for
that bag.
Once that process is complete, I visit each box in turn, and each bag
within each box in turn. For each bag, the comparison item is retained,
and everything else in that bag is removed from the collection.
Once /that/ process is complete, I have my set.
As far as I can see, this will work just fine. But does anyone have any
better ideas?
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
- Posted by Chris Uppal on March 3rd, 2007
Richard Heathfield wrote:
[Above quoted just as a random sample]
A couple of thoughts.
You could compute and store full hashes (including every byte, not
just an initial subsequence) for these files (or whatever they are) as
you create them in the first place, which would be cheap. Or, if it's
too late for that, you could compute them (starting today) as a
background task so that they'd be ready before you need to run your
program.
If you use a good quality (say crypto-quality) hash with a sufficiently
large number of bits, then you can. with high confidence, skip all the
other tests. You can definitely skip the length test, and unless the
consequences of throwing away a non-duplicate are too dire even to
consider, then you can skip the costly byte-for-byte comparison too.
-- chris
- Posted by Richard Heathfield on March 3rd, 2007
Chris Uppal said:
<snip>
And very welcome.
That's an excellent idea but, alas, pre-computation is out of the
question in this case, as I won't have access to the item repository
until runtime, and the results will be required very rapidly.
Well, I actually get the length for free, and the initial sort* on
length minimises the number of comparisons by discarding definite
non-duplicates quickly.
* ...actually, the item descriptions are poured into a binary search
tree as they become available, so the input and sort steps are
combined.
A 12kg garden gnome hitting my head at 2m/s almost certainly qualifies
as "too dire to consider", so I think the last step really is
unavoidable.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
- Posted by Richard Harter on March 3rd, 2007
On Sat, 03 Mar 2007 12:32:18 +0000, Richard Heathfield
<rjh@see.sig.invalid> wrote:
[snip construction]
The obvious way to handle this is to use a "large" hash, i.e., one with
enough bits so that the probability of two items hashing to the same
value is minute. Use a hash table with bucket lists (size ~ 2 * number
of items) and indices computed raw hash module table size. Then when
an item is added we check to see if it matches size and hash with any
items in the bucket list. Only if there is a double match do we check
to see they actually are duplicates; duplicates aren't entered.
This scheme is pretty good. Each item is examined only once (to get the
hash) if it is not a duplicate. Each duplicate costs three
examinations, one for its hash and two for the comparisons. The code
and data structures are simple. But can we do better? Yes, much
better.
The general idea is that we record for each item (of the same length) a
byte position where that item differs from some other item. When we
insert a new item we check each recorded byte position to see the
current item matches at that position. When it is established that
there is no conflict between the current item and some existing item
then and only then do we do a full comparison.
Why is this better? To begin with it, it eliminates the need for
computing a hash, that being much more expensive than the comparisons.
Secondly entries of non-duplicates require on average the examination of
a small number of bytes. For large items the savings are significant.
Just to expand on this a bit in case the above isn't clear, suppose we
have strings S1, S2, S3, .... We enter S1. When S2 is entered we
compare it with S1 until a difference is found at position k1. For k1
we create a list containing S1 and S2. When S3 is entered we check
position k1. If it doesn't match either S1 or S2 then we add S3 to k1's
list. If it matches one of them, say S2, then we start a comparison of
S2 and S3 until a difference is found at position k2. For k2 we create
a list containing S2 and S3. Finally we have a list k of lists k_i.
Run through list k to check each k_i in turn. There are some obvious
refinements.
On a personal note, I am embarrassed. I have implemented the hash table
approach a number of times over the years in a number of contexts. It
wasn't until someone else posed the problem that it occurred to me to
consider whether there was a better way.
- Posted by Logan Shaw on March 3rd, 2007
[ snip problem of finding unique items, and hashes not being good enough
to guarantee uniqueness, thus requiring full comparisons ]
Richard Heathfield wrote:
The way I see it, once you have used heuristics (hashes) to prove what
you can about uniqueness, you are left with groups of things that
are probably equal (with very high probability if you run some
cryptographic hash one them). Since by definition at this point you
have used up all your heuristic ideas, you must now do full comparisons.
And, here are the three insights at that point that determine the
running time:
(1) it's obvious every item must be compared to at least one other
item; otherwise, you have no information about whether it's
equivalent to anything.
(2) every subset of items must somehow, at some point, be compared
to every other (disjoint) subset of items; otherwise, you could
end up with cliques. (hmm, this is a more general form of #1.)
(3) equality is transitive and symmetric (and reflexive); therefore,
comparing A to B and B to C gives you information about the
relationship of A and C.
Because of #1, if you have N items, you need a minimum of N-1
comparisons, absolutely. Because of #3, you can do a chain of
comparisons (where you group the items on a line and compare
each to its neighbor), and that chain can be size N-1 and still
succeed.
Therefore, N-1 is sufficient to determine whether they are all
equivalent, and there is an easy way of doing N-1 comparisons.
As for the question of what happens when all N items are not equal,
is it really that important? If you've already done, say, even MD5
on them, the chances of all N items not being equal are really quite
small. The worst case is that all N items are distinct even though
they have the same MD5 checksum, and in that case, I think your
algorithm is O(N^2), but the probability of having N MD5 collisions
out of N items is so tiny it's not even funny.
Nevertheless, if you do reach this point, this problem can be
reduced to sorting. If you sort the items using some O(N log N)
sort, then you can compare each one to its neighbor using a
total of N-1 comparisons. So, you can solve the problem by doing
a sort. That is, you can improve the worst case by doing a sort.
Come to think of it, another way to solve this problem is to forget
the bags and boxes and shelves. Once you have a bunch of things
you think are probably identical, try to put them into some kind
of balanced tree (like a B-tree), doing the full comparison of
the new item with every node you visit. If you have proved based
on full comparison that your new item is a duplicate of an existing
node, discard it at that point. In the best (and common) case,
this will do only N-1 comparisons, because the tree will have only
one node. But if you have the bad luck to have N distinct items,
it's still only O(N log N).
Of course, with all this sorting business, I'm assuming there is a
full ordering defined for the items. But it sounds like they are
ordered sequences of bytes, so there probably is a full ordering
defined. :-)
- Logan
- Posted by Martin Eisenberg on March 3rd, 2007
Richard Heathfield wrote:
At this point you have a set of singleton bags. Unless new items
might enter later on, you can flatten that indirection right away and
put a handle that would otherwise go in a nonempty bag on a plain
deletion list instead. Right?
Martin
--
When he had finally achieved a position that
allowed him to say everything he thought, he
only thought of his position anymore.
--Gabriel Laub
- Posted by CBFalconer on March 3rd, 2007
Richard Heathfield wrote:
It seems to me that the fundamental data characteristic is
bigness. I am assuming that this also implies external storage,
i.e. the complete thing will not fit into memory. That means that
the cost of processing is NOT cpu time, but i/o time.
I would suggest that you compute your hash and attach the size to
it as the high order digits. Now sort the result on that composite
hash. To minimize i/o look up the polyphase sort (Wirth has a good
description in "Algorithms + Data Structures = Programs"). At this
point you can immediately copy all the unique values to the output
file, and need only deal with the remainder, which should be
greatly reduced. You may want to replace the actual data in the
various records with pointers to the appropriate place in the
source file, for further reduction in i/o.
The data item that you deal with might be described as:
struct foo {
fileposition datalocation;
someunsignedtype size;
long hash; /* possibly long long */
}
One read through the input data will set up these values. After
processing one more read will allow splitting into keepers and
checkers files. Now you reduce the processing to the checkers
files, possibly by lengthing the max size hashed, or by direct
comparisons. With any luck you are now down to something that can
be handled in memory, and the total i/o is approaching two reads
and one write.
--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>
- Posted by Richard Heathfield on March 3rd, 2007
Martin Eisenberg said:
Yes, absolutely. The reason I'm not doing it that way is only that the
deletion process is subject to user confirmation - so I need to present
full information to the user: "here is a list of items that are
identical - I propose removing all except this one - do you agree?"
The potential cost of wrongful deletion is high (garden gnomes at 20
paces), which is why I need the user's sign-off on each item.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
- Posted by Lane Straatman on March 3rd, 2007
"Richard Heathfield" <rjh@see.sig.invalid> wrote in message
news:QbSdnQNAMNrh93TYRVnyvgA@bt.com...
There are downthread posts that seem relevant to moving your question
forward.
kicks, we might want to say something formal about the set. Lunar eclipse
in England now. LS
- Posted by Richard Heathfield on March 4th, 2007
CBFalconer said:
Yes.
Not quite that big. Fortunately, I will have humungous amounts of
storage available on the target machines (at least, in comparison to
what I'm used to!).
<excellent idea snipped>
I'd like to thank everyone who made constructive suggestions.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
- Posted by Zeljko Vrba on March 4th, 2007
On 2007-03-03, Richard Heathfield <rjh@see.sig.invalid> wrote:
If you use 256-bit cryptographic hash, the probability of collision is 2^-128.
According to some sources, which I forgot,this is less than the probability of
hardware (CPU) failure. But OK, you know best the application domain 
- Posted by Richard Heathfield on March 4th, 2007
Zeljko Vrba said:
<snip>
That's the probability of collision between two arbitrary object hashes.
The Birthday Paradox reduces it considerably when you're dealing with a
large number of objects. I think you'd need 2^64 objects for a 50%
chance of a collision. Clearly that's still pretty humungous, but it
isn't anywhere near as humungous as 2^128, and in any case even a 0.01%
probability of collision would be unacceptable. You can't say to your
unlucky user "don't be daft, the odds against that are 10,000 to 1, and
in any case none of my other 9,999 users has complained" - well, okay,
you *can*, but you shouldn't.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
- Posted by Randy Howard on March 4th, 2007
On Sun, 4 Mar 2007 02:07:46 -0600, Richard Heathfield wrote
(in article <v7GdnTS_W51j4HfYRVnysAA@bt.com>):
Isn't that exactly what MS does, but with much worse odds?
--
Randy Howard (2reply remove FOOBAR)
"The power of accurate observation is called cynicism by those
who have not got it." - George Bernard Shaw
- Posted by Chris Uppal on March 4th, 2007
Richard Heathfield wrote:
I think Zeljko had already allowed for the Birthday Effect when he cut
down from the postulated 256 bits of hash to (2^128) entities needed to
expect ~1 collision.
BTW, another way of formulating this problem is to think in terms of
equivalence classes. The equivalence relation (which is progressively
refined as you go on) is:
F1 and F2 are equivalent iff they are the same length and the first
min(length, N) bytes of F1 and F2 are the same.
Initially N is zero, so your equivalnce classes just contain files of
the same length. Then in each step you increase N by some fixed amount
(something filesystem friendly) and subdivide all the current
equivalence classes which have more than one member accordingly. You
can use a temporary table of hashes of the bytes in [N, N+8K) for each
file in the equvalence class you are working on (an approximation to
save having to keep all 8K bytes in memory for each member of that
class). There is a tiny possibility that you'll fail to "realise" that
two files differ at this stage, but since you'll be doing an explicit
byte-for-byte comparison at the end anyway, the "error" doesn't cause
the program to fail, and only increases the expected runtime by an
unimaginably small amount.
But I'm not sure about the IO behaviour of that way of doing things.
It seems to require a lot of disk head movement (at least 1 seek for
every 8K read), which, I think, is likely to be a killer for
performance, even if the expected /amount/ of daya read is near-minimal.
-- chris
- Posted by Richard Heathfield on March 4th, 2007
Chris Uppal said:
Oh, so he had. Obviously I had MD5 on the brain. :-)
<snip>
I just finished a test run for a few hundred thousand arbitrary-sized
items, which contained a few tens of thousands of duplicates. It took
about 6000 seconds, which is longer than I would have liked but not as
bad as I'd feared. Target machines are likely to be far faster than the
test machine but, even so, I'll be looking for ways to bring that time
down. It's the slowest program I've written in years!
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
- Posted by CBFalconer on March 4th, 2007
Richard Heathfield wrote:
Did you measure the time to just copy the input to output?
Obviously that is a limit for the attainable speed. This may need
to include time to parse the records.
--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>
"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
- Posted by Richard Heathfield on March 5th, 2007
CBFalconer said:
No, that's not actually a limit, because no items actually need to be
copied. In the best case (all items have a unique size), runtime is as
close to 0 as makes no odds - well, okay, there's the n log n
behaviour, but that's a few tens of seconds at most.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
- Posted by wade@stoner.com on March 5th, 2007
Minor variations:
1) Keep hashes for the first block, and each initial 2^i blocks of a
file/bag.
2) Look for patterns in where differences occur. In many domains,
differences would be most likely at the beginning or end of a document
(for instance, the last eight bytes of a document might be a
checksum). Check those locations first.
3) If two or more documents have the same hash, designate one of
them as the "reference" document. Record the location of the first
difference in the non-reference document(s), as a place to check
first.
4) Don't bother with computing new hashes when you've already
narrowed a potential equivalence class down to just two items.
5) Against a smart adversary (he wants you to burn cpu time) check
blocks in a random order. If you use a fixed order, he'll make sure
the only difference is in the last place you check (and if he is
really smart, he'll make sure the difference is transparent to your
hash function).
- Posted by Richard Heathfield on March 5th, 2007
wade@stoner.com said:
<good ideas noted and snipped>
Fortunately, my only adversary is Lady Luck, who - contrary to popular
belief - is not actually out to get me. :-)
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
- Posted by Randy Howard on March 5th, 2007
On Mon, 5 Mar 2007 09:44:53 -0600, Richard Heathfield wrote
(in article <XcudnQmJu8Q-p3HYnZ2dnUVZ8tfinZ2d@bt.com>):
Even the truly paranoid have real enemies. :-)
--
Randy Howard (2reply remove FOOBAR)
"The power of accurate observation is called cynicism by those
who have not got it." - George Bernard Shaw