- Byte to byte compare, duplicate file finder/killer
- Posted by penang on September 18th, 2007
Hi !
I am thinking of creating a duplicate file finder ( and/or killer). I
have downloaded and played with several of such utilities, each has
its own strengths and weaknesses. Among the common file comparison
features are checksum, MD4, MD5 and byte-by-byte.
Before I give this project a go, I need to humbly ask all the gurus
here if (any of) you know of any pre-existing similar project, and/or
code (or even part of the source code) that I can take a look see.
I ain't good at programming, in fact, the main reason I want to do
this project is to learn in ins and outs of assembly language. Plus,
imho, assembly language really fits this project.
I thank you all !
Good Day !
- Posted by Bjarni Juliusson on September 18th, 2007
penang wrote:
I'd say shell script really fits it.
But I suppose it isn't much harder in assembly, and it might be useful
to get to do some system calls, file handling and dynamic memory
allocation and all that.
What should be done? Traverse the directory tree and stat all files to
get the sizes and store names and sizes in an array, then sort the array
by size, then find identical sizes and checksum the files and sort by
cheksum within each length, then compare all pairs with identical size
and cheksum?
The checksums are good in the cases where there are many files of the
same size, to reduce the number of pairs to compare. There is some room
for optimisation there.
I didn't really answer your request. I don't know of any particular
program to do this. :-)
Bjarni
--
INFORMATION WANTS TO BE FREE
- Posted by Terje Mathisen on September 18th, 2007
penang wrote:
Asm has absolutely _nothing_ to recommend it for this project, except
the sufficient reason you've already stated: You want to learn!
Anyway, the basic design should be something like this:
Stage 1) For each file collect meta data: Name, Size, Timestamp.
Stage 2) Sort this first list by file size. This is an important
optimization because there's no need to actually open and read every
byte of a file which has a unique size: It obviously cannot be a true
dupe of any other file,so we can remove it from the list1
Stage 3) Go through the sorted list, for each file size that remains all
the potential duplicates must be checked. If there are N files with size
S, then a byte-to-byte compare will need to process each file N-1 times,
so something faster is needed:
I would use a reasonably fast checksum algorithm (CRC-32) instead of a
cryptographically strong hash (MD5 or SHA-160), simply because the CRC
calculation is much faster to calculate.
Stage 4) For any pair of files that survive the previous stages, we do a
full byte-to-byte compare, this will confirm that they are indeed
identical, with odds of about 1:4e9 against a false positive.
By using a secure hash instead of CRC, the actual byte-to-byte compare
can be skipped, but since both files will probably be in the system file
cache at this point, so the compare will run at memory bandwidth speed.
The final runtime of this process will be totally limited by the disk IO
rate, so you can write it in asm, C, Java, Basic or perl: The timing
will be more or less the same.
(In fact, were I to write such a program I would almost certainly use
Perl, since it has all the infrastructure needed. :-)
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
- Posted by robertwessel2@yahoo.com on September 18th, 2007
On Sep 18, 2:50 pm, Terje Mathisen <spamt...@crayne.org> wrote:
Well, a decent implementation of a secure hash function (say SHA-1) is
still going to be fast enough that you're basically going to be I/O
limited. SHA-1 should process well in excess of 100MB/s on any semi-
modern PC. An advantage of a cryptographically secure hash is that
you're not at risk for a severe performance problem, potentially to
the level of a DoS attack, from encountering a maliciously constructed
set of files, namely (different) files which generate the same
checksum (files with identical CRCs or checksums are trivial to
generate). Consider what happens if you have a thousand megabyte
files generating the same checksum - you're now stuck doing half a
million file comparisons.
If you're not worried about malicious attackers, a simple 32-bit
checksum is probably just as good as CRC-32 for screening purposes,
and certainly easier to implement.
- Posted by Gianni Mariani on September 18th, 2007
I removed all cross posting and left comp.programming.
penang wrote:
Yes - sounds like a nice tool - I have wanted to write this one myself
for a while.
If C++ is a language you want to use, I have a option for you, the
latest alpha of austria c++ has all the tools you need, it should be
fairly trivial and you can make a cross platform (win32/linux) version
fairly quickly.
The issues you will find is that when byte for byte comparing files, you
start thrashing the disk, so you want to read large chunks of files, and
if you have large numbers of small files, you may want to issue multiple
reads simultaneously.
Anyhow, you probably want to run a very low cost hash function across
all your files and only bother comparing the hits. One thought is that
you only hash the first 64K of each file and hash that and the size.
This would probably eliminate most of the hits. Then, if you have hits,
you can either do a more expensive hash or compare files byte by byte
(or both is much better).
If you're interested in doing this, I will help. We can add it to the
austria C++ project on sourceforge if you like...
- Posted by Logan Shaw on September 19th, 2007
Bjarni Juliusson wrote:
I agree. In the past, I've tended to use shell scripts for this.
The first time I did it, I had about 100 Unix machines and needed to
run a program on all of them, then check the output of all of every
run to make sure there were no errors. Often, the output would be
identical among lots of hosts, but not always.
So, I'd write a little script to run the program on every host and
store the output in one file per host:
mkdir output
while read host
do
rsh -n $host /path/to/program > output/$host 2>&1 &
done < host-list
wait
(Yes, I used good old insecure rsh -- it was a while ago.)
Then it was time to group output into groups of equivalent files so
I didn't have to read them all:
cd output
for f in *
do
sum=`sum $f | awk '{print $1}'`
mkdir $sum
mv $f $sum
done
That works for a set of disposable files. A more general approach
for files strewn all over everywhere would be something like this:
find starting-directory -type f -exec sha1sum '{}' '+' |
sort |
less
Pretty concise and it does the trick (although it reports all files,
not just those with duplicates).
- Logan
- Posted by Terje Mathisen on September 19th, 2007
robertwessel2@yahoo.com wrote:
You're right, the best approach does depend on the number of equal-sized
files, and how paranoid you want to be:
If there are just two files of the same size, we should simply compare
them until the first difference, skipping the (secure) hash stage.
Since MD5/SHA-160 can indeed keep up with disk IO rates, they can be
used instead of a simpler initial checksum. I would probably go for
using two or three orthogonal 32-bit checksums instead:
CRC-32, simple ADD of all dwords (possibly with carry wraparound like
the TCPIP checksum), and an XOR of the same dwords.
This is just to limit the amount of metadata that must be saved, but
using MD5 directly would be fine, except that it would make it harder to
write everything from scratch, in asm.
If the files are large (larger than a MB or so), I would split the
checking into two stages:
a) Calculate the checksums/hashes for the first MB, then compare these.
If they are different, then there is no need to process the rest of a
650 MB ISO image, right?
b) Starting from the data generated by (a), do the full hash calculation.
Right, see above.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
- Posted by Phil Carmody on September 19th, 2007
Terje Mathisen <spamtrap@crayne.org> writes:
It's not actually necessary to sort, one simply takes the size
as the single most important discriminator between files.
This presumes you have the RAM to remember every file you've
already seen (presumably in an a structure indexed by size)
so that as soon as the second with that size appears you can
md5sum (or whatever) it.
But I/O bound in general. Taking the largest file I could find,
cksum and md5sum were assymptotically identical in user/sys time.
Without the files in the cache, disk I/O would dominate both
procedures. The fact that CRCs can be stored in 32 bits doesn't
matter, one could simply truncate the md5 if memory was an issue.
Perl also has the advantage that if you're trying to compact,
for example, Eric Weisstein's MathWorld's multiply repeated
images you can use the builtin regular expression functions
to rewrite all the IMG SRC urls in the .html files at the
same time to point to the copy of the file that you actually
want to keep.
Phil
--
Dear aunt, let's set so double the killer delete select all.
-- Microsoft voice recognition live demonstration
- Posted by hutch-- on September 19th, 2007
hmmmm,
The above statement is probably the most profound for the
circumstances. After checking file length and perhaps something to do
with file name the main factor for speed is how fast you can load the
two files to compare them. It would be a toss up between how large a
chunk to load versus how fast you find a mismatch with the smaller
chunk suiting the mismatch and the larger suiting the match.
Apart from testing them with the largest data size possible (QWORD
OWORD etc ...) the next problem is page thrashing between two large
files that don't both fit into cache. Testing a couple of 1 gig VOB
files would easily create this problem.
As the operating is a READ READ type I don't think off the top of my
head that cache pollution avoidance methods are any use here and the
final limit may simply be page thrashing.
Regards,
hutch at movsd dot com
- Posted by (Richard Harter) on September 19th, 2007
On Tue, 18 Sep 2007 07:55:16 -0700, penang <spamtrap@crayne.org>
wrote:
I have some good news and some bad news for you. The good news
is that there are some really nifty things you can do to make a
duplicate file finder. The bad news is that you will need to be
good at programming to implement them.
That said, you might look at Patricia trees or unordered radix
trees instead of using hashes. Why? Because when you use a
radix tree you do not have to (usually) look at the entire file
to discover that it fails to match an existing file. The savings
in I/O time can be enormous.
There are a number of public implementations of Patricia trees
available - you could search for one of them. However I will
suggest that you look at unordered radix trees. They are quite
new - as far as I know I am the inventor. There is an article
describing them at http://home.tiac.net/~cri/2007/urtree.html.
There are links to two implementations. They are set up for
key/value pairs; in this case the keys are the actual contents
and the values are the file ids. It also assumes that the keys
are in memory; you will have to modify the code to do file reads.
The licenses are BSD licenses.
Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins
- Posted by Mark F on September 20th, 2007
On Tue, 18 Sep 2007 07:55:16 -0700, penang <spamtrap@crayne.org>
wrote:
Salty Brine's FolderMatch, among others.
However, with NTFS there are things called "alternate streams"
that are ignored in all of the tree comparison programs that I found.
If you are interested NTFS you should look be sure to support
alternate streams.
- Posted by Mark F on September 20th, 2007
On Tue, 18 Sep 2007 14:37:03 -0700, "robertwessel2@yahoo.com"
<spamtrap@crayne.org> wrote:
1. compute a few different checksums at the same time.
2. Make a checksum that varies the parameters of its computation based
upon a starting random number. (For example, the weight of each
of the 8 32-bit fields used in making a 256-bit checksum would
depend on the starting seed.) (Determine if a second pass with
a different seed should be done before proceeding to the next
major step. Compare with using a 512-bit checksum to begin with.)
3. read and compare random parts of a bunch of files at the same time
Maybe pick 1% of each of 1000 files to compare with one "skip
pass" over each file. (The exact order of things here could
greatly effect performance, so your task is to determine the best
way.)
Pick the amount you compare at each spot
in each file to be a nice size compared to the disk organization.
Pick the samples randomly.
Determine what a good performance goal is and what worst case is.
- Posted by Lionel Delafosse on September 21st, 2007
"Richard Harter" <spamtrap@crayne.org> a écrit dans le message de news:
46f1447d.647566656@news.sbtc.net...
I have to disappoint you a bit. This data structure exists since 1981:
search for "trie hashing" or "hachage digital' (in french). Although given
very fast access and space efficience, the initial paper (Witold Litwin,
Trie hashing, Proceedings of the 1981 ACM SIGMOD international) has confined
this DS to the area of database access systems. There is a lot of different
versions (different radix, etc.) and cost studies: a search on Google give
you pages of links.
Best regards,
Lionel Delafosse
- Posted by Lionel Delafosse on September 21st, 2007
"Richard Harter" <spamtrap@crayne.org> a écrit dans le message de news:
46f1447d.647566656@news.sbtc.net...
I have to disappoint you a bit. ;-) This data structure exists since 1981:
search for "trie hashing" or "hachage digital' (in french).
Although given very fast access and space efficience, the initial paper
(Witold Litwin, Trie hashing, Proceedings of the 1981 ACM SIGMOD
international) has confined this data structure to the area of database
access systems.
There is a lot of different versions (different radix, etc.), improvments
and cost studies: a search on Google give you pages of links.
Best regards,
Lionel Delafosse
- Posted by Richard Harter on September 21st, 2007
On Fri, 21 Sep 2007 14:15:15 +0200, "Lionel Delafosse"
<lionel.delafosse@numericable.fr> wrote:
Thanks for the comment and the suggestion to look at trie
hashing. I wasn't familiar with trie hashing so I am just now
going through Leen Torenvliet and P. van Emde Boas's paper which
seems to be a good one for my purposes.
If my initial impression is correct, the unordered radix tree
data structure is quite different from trie hashing data
structures even though they do share some features. In
particular the UR tree preserves neither order nor prefixes
whereas tries and trie hashing do preserve order. My impression
is that trie hashing is much more closely related to Patricia
trees than it is to UR trees.
Again, thanks for the comment and the pointer.
Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins
- Posted by robertwessel2@yahoo.com on September 21st, 2007
On Sep 20, 4:16 pm, Mark F <mark49...@gmail.com> wrote:
Instead of all that trouble, just use SHA-1 or SHA-256. It's solid,
and will be more than fast enough.
Well, you can't really pick the samples randomly, or you won't get
comparable hashes. But it would not be unreasonable to checksum, say
the first 64KB of each file, and then 64KB every megabyte or ten
thereafter, to generate a filtering hash value. You need to skip
fairly lightly though, since you're replacing a sequential read of the
file (with a little luck the disk won't be too fragmented), with a
bunch of seeks. And then you'd still have to go back and hash the
complete file if it matches on the filter hash. It would probably be
just as good to base the filter has on just some fixed beginning
portion of the file (as suggested elsewhere in the thread). That
should be big enough to fully hash most files (perhaps something on
the order of a megabyte), leaving only really big files as needing a
final pass (or with the sampled hash).