- Improving a dupe check algorithm
- Posted by Stefan Schmidt on September 4th, 2003
Hi!
I recently wrote my first real C++ program to scan for duplicate files
in a tree and output a list to the standard output.
Any opinions on my algorithm would be great! It seems it scales "OK",
but I wonder how some of you would do it?
Here's what I did:
Step #1: Put filesize(key)/filename pairs into chained hash table,
large number of buckets (prime).
The insertion checks if a node already exists for size, and adds the
filename to the list in that node (STL std::list). Otherwise, create
a new node.
Step #2: Eliminate unique entries.
Traverse the table and wipe out nodes with only 1 filename in its list.
Step #3: Calculate CRC32 for nodes with more than 2 filenames. Remove
unique CRC - this awkward point I had not forseen in my design: multiple
sets of dupes within the one filesize. I coded special case to handle
it, but it is awkward.
Step #4: Do byte-by-byte comparison on files to eliminate a
false-positive.
I wrote a custom hash table container which used a derived STL list
class to handle the special insertion condition and hit counter.
Interested to hear from you!
Anyone can have my source, just email me!
BTW I'm not very experience with STL so be gentle!! :-)
- Posted by Ben Pfaff on September 4th, 2003
stefanschmidt82@yahoo.co.uk (Stefan Schmidt) writes:
You could use SHA-1 instead of CRC-32 and skip the byte-by-byte
comparisons. You won't find any files that differ but have
identical SHA-1 checksums.
--
"Premature optimization is the root of all evil."
--D. E. Knuth, "Structured Programming with go to Statements"
- Posted by John L on September 4th, 2003
"Stefan Schmidt" <stefanschmidt82@yahoo.co.uk> wrote in message news:da605119.0309031911.2cf8ab69@posting.google.c om...
.... and then add the filename, presumably.
Could you not handle this in the same way you dealt with
sizes: calculate CRCs, sort CRCs, eliminate unique CRCs?
This can be an issue in programming btw. If after some time
you cannot find an elegant solution, code the inelegant one
so you at least have something that works rather than nothing.
The trick is (and this is often easier said than done in
real projects) to remember to go back and polish it up.
In real life, you may be moved to another project before you
can do this, so be sure to comment it properly so someone
else knows to look for an elegant replacement.
What you do not do is distinguish between duplicate files
and filenames that refer to the same (single) file -- in a Unix
system, for instance, hard links and soft (or symbolic) links
would need to be dealt with.
Problems like permissions, and the filesystem changing while
your program is running, are probably not very important to you.
John.
- Posted by Arthur J. O'Dwyer on September 4th, 2003
On Thu, 3 Sep 2003, Ben Pfaff wrote:
Clarification: You *might* find two files that differ but
have identical SHA-1 checksums, but it is supposed to be
"computationally infeasible" for anyone to actually produce
two such files. (Murphy's Law doesn't care about infeasibility,
last I checked. ;-)
Using SHA-1 instead of CRC32 doesn't mean you *shouldn't*
check for false positives. It just means you won't have
to check as often.
-Arthur
- Posted by Corey Murtagh on September 4th, 2003
Ben Pfaff wrote:
SHA-1 produces a 160-bit message digest. If you examine 2^160+1 files
with identical length but unique contents, at least two of them will
have the same SHA-1 message digest. In fact it's highly likely that a
large number of them will have the same SHA-1 message digest.
In other words, regardless of which checksum/message digest method you
use, you still have to do a byte-by-byte comparison of the files. Using
SHA-1, or even MD5, will of course reduce the number of false positives,
but can never actually /eliminate/ them.
In practice it's highly unlikely that two files with the same length
will have the same SHA-1 digest. It's a bit less likely to produce a
false positive than MD5, not only because of the extra 32 bits but
apparently because the algorithm is 'better'.
--
Corey Murtagh
The Electric Monk
"Quidquid latine dictum sit, altum viditur!"
- Posted by Ben Pfaff on September 4th, 2003
Corey Murtagh <emonk@slingshot.co.nz.no.uce> writes:
You don't have enough time to examine 2^160 files. You don't
even have enough time to examine 2^80 files, which if I recall
correctly due to the "birthday paradox" is the point at which
there is a 50% chance you will have found a collision.
In practice, chances are good that we will never find a collision
in SHA-1 in our lifetimes, barring a crack found in the SHA-1
algorithm itself.
--
blp@cs.stanford.edu - blp@gnu.org - pfaffben@debian.org - blp@acm.org
- Posted by Corey Murtagh on September 4th, 2003
Ben Pfaff wrote:
Whether you can check that many files or not is not relevant. I was
just providing the simplest proof that collisions /can/ happen.
Regardless of how small, the chance still exists that any two arbitrary
files with non-identical contents will have the same SHA-1 digest. You
can *not* simply assume that those two files are identical.
I did agree that the chances of it happening are remote, but since it
can still happen then you still have to check for it.
--
Corey Murtagh
The Electric Monk
"Quidquid latine dictum sit, altum viditur!"
- Posted by Stefan Schmidt on September 4th, 2003
Hi!
Thank you all for your responses on SHA-1 vs CRC, it has been quite
interesting.
I wonder does anyone have a comment on the data structure I chose... it
is something I spend much time thinking how to improve.. or is it good
solution?
Thank you!
Stef.
- Posted by Thad Smith on September 4th, 2003
Ben Pfaff wrote:
As noted by others, using SHA-1 doesn't prohibit identical checksums of
different files, just makes it much more unlikely. In the normal case
that difference, though, is basically due to the number of resulting
bits, not the algorithm. If you are going to check full file contents
with matching CRCs, you might as well use something efficient for the
CRC calculation.
If you are going to rely on CRC matches to indicate file matches, I
suspect you can compute a larger CRC faster than the SHA-1 version.
Thad
- Posted by Programmer Dude on September 4th, 2003
Corey Murtagh wrote:
[BWG] And if you can generate one million SHA-1 digests per *second*,
it'll take you a mere 46343912903694283301740386628497051 years!
--
|_ CJSonnack <Chris@Sonnack.com> _____________| How's my programming? |
|_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL |
|_____________________________________________|___ ____________________|
- Posted by John W. Krahn on September 4th, 2003
Stefan Schmidt wrote:
Are you assuming that two identical files will have the same file name?
John
--
use Perl;
program
fulfillment
- Posted by Calum on September 4th, 2003
Stefan Schmidt wrote:
This would be an excellent opportunity to learn how to use a profiler.
(e.g. gprof). That will tell you for sure the parts of the program
which are the slowest.
I'd be very surprised if it was the data structures that would slow you
down since it is all in memory, and you have taken a sensible approach.
I would guess the slow part is scanning the file contents for equality,
so that's the part you should think about optimizing.
For example, reading the files in byte-by-byte is considerably slower
than reading it in large blocks, and doing a memcmp(). Also, how many
times do you read in the contents of each file? More than once is
excessive. However this could get quite fiddly, and in my opinion is
the harder part to implement. You'd have to compare the files block by
block, and keep a list of which files were equal to which others.
Another point - you don't need to compute the CRC/hash if you discover
your file is unique half way through reading it, just stop.
Personally I would have relied more on an off-the-shelf container, e.g.
std::multimap<int,string>, or std::hash_set<> and written my own
hash_compare functor. It may not be faster, but much quicker to set up
and more thoroughly debugged, as I suspect it's not your container that
is the bottleneck.
As a general rule, IO is much slower than your in-memory data
structures, so minimize and optimize the IO first.
Well done!
Calum
- Posted by Corey Murtagh on September 5th, 2003
Thad Smith wrote:
<snip>
Never having played with SHA-1 I have no idea of the throughput of the
algrorithm. Is it slow enough to siginificantly impact performance when
the data is being read from a hard drive? If you're reading a lot of
files - scanning an average hard drive partition for example - I guess
the end result could be several seconds difference. But percentage of
the overall time is taken up by the SHA-1?
I guess I should download some code and try it, huh? :>
--
Corey Murtagh
The Electric Monk
"Quidquid latine dictum sit, altum viditur!"
- Posted by TLOlczyk on September 5th, 2003
On 03 Sep 2003 22:27:02 -0700, Ben Pfaff <blp@cs.stanford.edu> wrote:
the floating point error bug".
- Posted by TLOlczyk on September 5th, 2003
On 3 Sep 2003 20:11:46 -0700, stefanschmidt82@yahoo.co.uk (Stefan
Schmidt) wrote:
Modify the CRC to generate n numbers in the following sense:
the first number is the CRC of the first length of file/n bytes,
the second is the CRC of the first 2*length of file/n bytes
....
On others calculate a "running CRC".
Then compare the CRC's of the files in descending
order ( the more bytes the more likely you are to get
a miss, so by doing it in descending order, you
make it more likely that you will discover the discrepency
sooner ).
This cost little extra to do, just caching some values, while
gauranteeing to eliminate a large fraction of the false positives.
- Posted by Vlad Tepes on September 5th, 2003
Corey Murtagh <emonk@slingshot.co.nz.no.uce> wrote:
This might give some feel for the difference in performance between the
two methods:
In a directory with 77 files, totalling 47 MB, I got these results when
timing sha1sum and crc32:
sha1sum * 2>& 1 >| /dev/null 4.57s user 0.28s system 100% cpu 4.833 total
crc32 * 2>& 1 >| /dev/null 1.27s user 0.23s system 104% cpu 1.437 total
It appears crc32 is about three times faster than sha1sum.
Regards,
--
Vlad
- Posted by LibraryUser on September 5th, 2003
Stefan Schmidt wrote:
I think you could simplify the implementation by using hashlib,
available at:
<http://cbfalconer.home.att.net/download/>
in that it will eliminate any considerations of the size of the
hashtable, and allow you to freely modify your data structures to
suit. You could fabricate keys out of the combination of
filename (whatever that may be) and file CRC. This does not
require that you calculate the CRC on initial entry, only that
you jam the value to, say, 0 (which will not recur with a
suitable real CRC computation). The library contains several
demonstrations, some of which maintain chains (such as the markov
lists), and illustrates walking the complete table contents at a
suitable point.
--
Replies should be to the newsgroup
Chuck Falconer, on vacation.
- Posted by Ben Pfaff on September 5th, 2003
TLOlczyk <olczyk2002@yahoo.com> writes:
The situations aren't analogous. Intel used a bad algorithm,
whereas SHA-1 is, as far as anyone knows, a good algorithm.
--
"doe not call up Any that you can not put downe."
--H. P. Lovecraft
- Posted by Arthur J. O'Dwyer on September 5th, 2003
On Fri, 5 Sep 2003, Vlad Tepes wrote:
True; but whether the CRC32 *algorithm* is faster than SHA-1
remains to be seen (in this thread; of course *someone's* probably
compared all these hashes at some point and posted the results, with
source code, to the Web. We don't need to rehash them here).
Incidentally, my school's Linux system has something called
'sha1sum', but nothing called 'crc32'; and I don't think it
has the source code for either readily accessible. So your
numbers are meaningless to me. Better to post some actual
analysis, if any can be done; or better, a URL or two pointing
to the actual analysis that's already been done.
[BTW, how many CPUs does your machine have, that sha1sum
is reported as using 104%? :-) ]
-Arthur
- Posted by Vlad Tepes on September 6th, 2003
Arthur J. O'Dwyer <ajo@andrew.cmu.edu> wrote:
Corey Murtagh wondered wether there could be a difference in seconds
between using crc32 or sha-1 on an average hard drive partition. My
informal benchmark show that this is very likely.
A more formal table comparing commonly used cryptographic algorithms
can be found at http://www.eskimo.com/~weidai/benchmarks.html .
That page suggests crc32 being about 4.5 times faster.
Hope this helps,
--
Vlad