- Would this be a useful p2p utility?
- Posted by anthonyberet on February 5th, 2005
I am learning to program in Python
www.python.org
I am quite new at it, but am just getting to the point where I want a
project which has a real purpose, and I have an idea for a useful file
utility which doesn't seem to exist out there already.
I want to make a fuzzy filename-matcher.
By that I mean that it would scan a directory, or directories, and make
a listing of the files therein, grouped according to the similarity of
their filenames. The purpose would be specifically for the kind of files
that p2p users tend to collect, in order to help weed-out duplicates
which are different rips and therefore not detectable by CRC matching or
direct bit-level comparison.
A lot has been written about fuzzy-matching - most of which is quite
technical, but it seems to me that these existing routines are weak in
that they will strongly mark a match down if the words appear in a
different order in the compared strings. - This isn't much good for
filenames of media files.
The routine I am thinking about using to make matches would be to first
copy only the apha-characters from the filename of 2 files, and to count
the length of the resulting strings.
Thus The_Beatles - Eleanor_Rigby.mp3 and Elenor_Rigby-The_Beatles.ogg
would become 'thebeatleseleanorrigby' and 'elenorrigbythebeatles'
I would then give the shorter string a point for every character it
contains that is also contained in the longer string. I would also need
to code it so that it would allow for duplicate letters - ie there
should be 5 points awarded for the 'e's in the above example, but no
more than the longer string contains.
I would do this by looking for each letter in the shorter string in
turn, then removing the first occurrence of it from the longer string,
and repeating until all the shorter string has been processed.
I would then divide the score for the shorter string by the length of
the longer string, giving 21/22 in our example, and convert to a
percentage, or 95.45% - a close match for this one which is clearly a
duplicate.
Some more results from my algorithm:
Beatles - Nowhere man.mp3 vs The Beatles - Yellow Submarine.wma
'beatlesnowhereman', 'thebeatlesyellowsubmarine' =56%
'elvispresleybluesueadeshoes' Vs 'lovewilltearusapartjoydivision' =50%
But of course:
'presleyelvisbluesuedeshoes' Vs 'elvisbluesuedeshoes' = 76%
'lovewilltearusapartjoydivision' Vs 'joydivisionlovewilltearusapart' =100%
Of the strings I have examined by hand I think a threshold for a
probable match of about 70% should be useful. Of course it would cycle
through all the files in the directories to compare all with eachother,
then present them in descending percentage order.
A bit later I will find out how to add routines so that users can play
the matched files and delete/move them, via a right-click.
What do you guys think? - Would you find a utility like this useful?
- If you know of one, then do let me know, but I might well go ahead
with this anyway.
What I *really* want is for somebody to come up with a better
comparative algorithm - as the above is by no means perfect, but from my
reading on the subject, I think this is hard to do.
Any suggestions welcome, anyway.
Any other ideas for utilities in fact? (try to keep them simple).
- Posted by Triffid on February 6th, 2005
anthonyberet wiffled:
<my head hurts>
You are up against the Turing Test. Suppose search 'thebeatleseleanorrigby'
and 'elenorrigbythebeatles' = 0 results. Try search eleanor -lindisfarne.
Zero results? Try 'The Beatles', that should get something, even if it's
wrong.
--
Despite appearances, it is still legal to put sugar on cornflakes.
- Posted by anthonyberet on February 6th, 2005
Triffid wrote:
searches (whether Winmx or otherwise).
The purpose of my utility will be to help manage collections of files
stored locally - i.e. if you collect an artist avidly, you will at times
download the same track in multiple rips - My utility is to help weed
the extras out and prevent them taking up HD space or causing repetition
in playlists.
I do this periodically with the windoze find-file facility, but I have
to search my HDs for each artist in turn, and then study the results for
duplicates - this is intended to bring such dupes to one's attention
with a minimum of effort.
- Posted by Triffid on February 6th, 2005
anthonyberet wiffled:
I use NoClone http://noclone.net/ for that, but you've gotta be really
*really* careful.
--
Despite appearances, it is still legal to put sugar on cornflakes.
- Posted by Don M. on February 6th, 2005
"anthonyberet" <nospam@me.invalid> wrote in message
news:36l7npF52lmmjU1@individual.net...
Trying to reinvent the wheel there? Why not use a database? There are many
free ones for handling music collections. The one I use can search
duplicates looking at tags and/or name and/or property (such as bitrate).
Don
- Posted by anthonyberet on February 6th, 2005
Don M. wrote:
particular role - comparing id3 tags is all well and good if the files
are tagged properly, but software to do this that I have tried has had
issues with being case-sensitive, and still couldn't recognise that
'beatles', 'The_Beatles', 'Beatles, The' etc are all the same.
Noclone is good for matching actual identical files (its fuzzy-matcher
is shit), but I prefer Duplic8 for that.
Bitrate shouldn't be an issue at all, apart from when chosing which to
delete.
Which database do you use though? - I will have a play with it.
- Posted by Don M. on February 6th, 2005
"anthonyberet" <nospam@me.invalid> wrote in message
news:36laanF52ib94U1@individual.net...
Mpeg Audio Collection http://mac.sourceforge.net/
Searches can be made case sensitive, it's an option; and you can search for
"eatles" if in doubt.
People who place artist on foldername but not on filenames may want to
search tags.
Don
- Posted by anthonyberet on February 6th, 2005
Don M. wrote:
<snip>
though?
- It doesn't search whole drives for possible matches that you haven't
noticed?
- Posted by Don M. on February 6th, 2005
"anthonyberet" <nospam@me.invalid> wrote in message
news:36llohF3b10feU1@individual.net...
You can scan your whole drives into the database as "volumes" and search the
database instead of doing a direct search on each drive.
I use it mostly for organizing offline CDR's and searching those; all 5 of
them.
Not sure what you mean by "only looks for files that you specify". I hope
it doesn't look for files you don't specify... 
Don
- Posted by anthonyberet on February 6th, 2005
Don M. wrote:
Yes I do want one that looks for files I don't specify - I want to be
able to specify the directory and filetype, but the point of my project
is to inform me of duplicates that I am not aware of.
- Posted by Don M. on February 6th, 2005
"anthonyberet" <nospam@me.invalid> wrote in message
news:36mianF52qdv6U1@individual.net...
That's where looking for properties and/or tags would be useful, you know,
stuff that's not obvious when reading folder/file names.
Of all the gin joints, I had to walk into this one. Being in the same
groups as you must qualify as a random event (thanks, BJ, for explaning
randomness)........ 
Don
- Posted by anthonyberet on February 6th, 2005
Don M. wrote:
Anyway, what file utilities do you dream of?
- Maybe we could actually create them...
- Posted by Don M. on February 6th, 2005
"anthonyberet" <nospam@me.invalid> wrote in message
news:36n4olF53c2k5U2@individual.net...
Oh boy, you love to ask tough questions, don't you?
It already exists, a file utility that is everything Windows Explorer should
have been. It's called EF Commander, http://www.efsoftware.com/cw/e.htm ,
the Windows version of the old DOS file utlity, Norton Commander. Some of
the same hot keys, e.g. F3 to view a file, F4 to edit, F5 to copy, F6 to
move, F7 to create a new folder, F8 to delete, + to select files, - to
de-select. Then you can set it to copy files from CD without the annoying
"read only" flag, open a ZIP/RAR file and copy/move contents as if it were a
folder, auto save current view so you always open the file manager where you
left it or always open it with the same view (auto save disabled), compare
directories, ftp just like you're using a local drive, navigate shares on
your LAN, type in a command (in the line box at the bottom), show folder
sizes (press Alt+F9), and still double-click to run a program plus all the
right-click task you do with Windows Explorer, and much more. To show
folder size is especially useful when you want to determine which folders
(e.g. albums) and/or files to fit on a CDR, so you do an Alt+F9 and press
the space bar to tag/untag folders (the total tagged appears on the gray
status bar at the bottom); then shrink the Commander window, drag & drop
selection onto your burning application's project window. Once you get used
to Commander, you'll never open Windows Explorer again.
And look! They have a "Duplicate Files Manager", which I hadn't noticed
before.
Don