Finding similar file names
This is a story about the creation of a nice script...
A friend of mine needed a script to find files which are similar, without reading their content. She suggested comparing the file attributes, like the file size and some other attributes.
So, I gave her this script. It works great when there not many files which have the same size. So, you can safely say that the files are similar if they have the same attributes.
But in some cases, this may be a real disaster because there may be lots of files that have the same attributes and are not actually similar. So, I came up with a new idea. I suggested grouping the files by size and compare their names.
If the shortest name is contained in the biggest name, then the files are similar. But not so fast. We need some rules, because we can have two files like: 'A happy file name.txt' and 'a.txt'! They are not similar, even if the shortest name 'a' is found in the longest name 'A happy file name'.
So, how to solve this?
Well, if the length of the shortest filename is lower than the half of the length of the longest filename, then they are not similar.
In other words, only when a substring of the shortest filename which has a length greater or equal than the half length of the longest filename and is a substring of the longest filename, then the files are similar.
Example:
'zzzgooglezzz' is similar with 'google', but not similar with 'googl'.
We all know that matching approximately is really slow, so I avoided using any approximation, even if I did this once in a very old-script.
This time, I wanted something fast and reliable. So, after many head-scratches, I wrote a pretty nice algorithm. It finds the longest substrings in the shortest filename, and when a substring is found in the the longest filename, it just returns the code 0. This is the script which I gave to my friend. It groups the files by size and compare their file names.
But, after this, a new idea came into mind. What about not grouping the files by size, but let's put them all together and compare the file names of each other.
This turned out to be extremely useful. I've found and deleted many MP3 files which had almost the same name, but different sizes - which didn't allowed me to find them when searching for duplicated files with various tools, like fdf.
To make it more interesting, all the substrings are compared case-insensitively, and the file names are UTF-8 decoded, so the character 'ș' is similar with 'Ș'.
We can, also make 'ă' to be similar with 'a', by converting the unicode characters to ASCII, using the Text::Unidecode module, but this is a little bit overkill, so I will leave this for the reader to implement. :)
Here is a screenshot:
Original Perl script:
https://github.com/trizen/perl-scripts/blob/master/Finders/find_similar_filenames.pl
Improved version in Perl (+ Levenshtein distance algorithm)
https://github.com/trizen/perl-scripts/blob/master/Finders/fsfn.pl
Faster version in C++:
https://github.com/trizen/cpp-learning/blob/master/fsfn.cpp
Even faster version in Go:
https://github.com/trizen/go-learning/blob/master/fsfn.go
Improved version in Perl (+ Levenshtein distance algorithm)
https://github.com/trizen/perl-scripts/blob/master/Finders/fsfn.pl
Faster version in C++:
https://github.com/trizen/cpp-learning/blob/master/fsfn.cpp
Even faster version in Go:
https://github.com/trizen/go-learning/blob/master/fsfn.go
Awesomeness!
ReplyDeleteI would buy you a beer if you add to it the ability NOT to track files as simular ones if they contain specific substring or word:
While looking on your finders on github account i noticed 2 of them:
find_similar_filenames.pl
find_similar_filenames_unidec.pl
Can you clarify what's the difference?
Thanks and keep up the good work!
P.S. Will there Python implementation after C++ and Go? ;-)
By 'ability NOT to track files as simular ones if they contain specific substring or word' I mean the following:
ReplyDeleteLet's say we want to exclude the following words: 'apple' and 'pear'. But we still don't want to have dups even with these keywords.
So, from the first subset, they should be divided on 2 subcategories and then processed regularly:
What we are having now (1 category):
-----------------
Super apple 1 filename
Super apple 2 filename
Super pear 1 filename
Super pear 2 filename
-----------------
with -f or -l option we will have only 'apple' or 'pear' in result. But I need both, since love fruits :)
What should be accomplished (2 subcategories):
-----------------
Super apple 1 filename
Super apple 2 filename
-----------------
Super pear 1 filename
Super pear 2 filename
-----------------
Thanks!
Also I'll buy you 2nd beer if you implement the ability to do comparison with minimum percentage similarity.
ReplyDeleteLike - mark the files as similar if they at least 50% // 70% // 95% the same.
Kudos to you!
Yay! Today I'm gonna drink two strong dark beers :D
DeleteTwo (actually three) new options flashed into existence:
--words [one] [or] [more] [words] [here] [...]
This option will group all the files in distinct groups which contain the specified words. Each word is actually a regular expression. For the words 'apple', written case-insensitive, it would be: '(?i:apple)'
--percentage [integer]
This option will mark the files as similar if they at least this percentage the same.
--round-up
For example: 75% of 21 is 15.75; this means that a sub-string of with a length >= 15 letters must exists in both filenames in order to be maked as similar. When '--round-up' option is specified, 15.75 will get round up to 16.5 (which is 16 in integer form). All it does is adding (percentage/100) to (len * percentage/100), giving stricter results.
The improved script can be found here: https://github.com/trizen/perl-scripts/blob/master/Finders/fsfn.pl
The script just became better and more flexible, and I would like to thank you for your awesome ideas. If you have more, don't hesitate to post them here, anytime. Thanks :D
Cheers,
Trizen
Sorry for the late response, Daniel. All I can say - wow! That's impressive >>
DeleteOctober 18, 2014 at 6:46 AM
October 18, 2014 at 8:34 AM
Thank u so much for super fast implementation! ...And welcome to the dark side! http://th06.deviantart.net/fs71/PRE/i/2012/135/0/a/guinness_poster_by_alwayswinter838-d4zxrsq.jpg
1923 ./fsim.pl --words=s,s .
ReplyDelete1924 ./fsim.pl --words=apple .
1925 ./fsim.pl --words apple .
1926 ./fsim.pl -w apple .
wasn't able to get work -w option - all the time getting back script start-screen:
usage: ./fsim.pl [options] /my/path [...]
Options:
-f --first! : keep only the first file from each group
-l --last! : keep only the last file from each group
-w --words=s,s : group individually files which contain this words
-s --size! : group files by size (default: off)
-p --percentage=i : mark the files as similar based on this percent
-r --round-up! : round up the percentange (default: off)
Example:
./fsim.pl --percentage=75 ~/Pictures
WARNING:
Options '-f' and '-l' will, permanently, delete your files!
Try to specify the directory before the -w option. Otherwise, the directory will get eaten by the greedines of this option, because it can take one or more arguments, and the directory will get considered to be a word.
DeleteWorking :) Thanks! I tried to put -w at the end yesterday as well, but didn't test it carefully.
DeleteIt would be awesome to have -w case insensitive optionally with -i flag. What do you think? I believe full-stack program already has this in place?
Hey Daniel,
ReplyDeleteJust few ##### comments:
$ ls -1
Super_apple_1_filename
Super_apple_2_filename
Super_pear_1_filename
Super_pear_2_filename
fsim.pl
$ ./fsim.pl .
./Super_apple_1_filename
./Super_apple_2_filename
./Super_pear_1_filename
--------------------------------------------------------------------------------
#### Super_pear_2_filename loosed somehow (?)
$ ./fsim.pl . -w apple
./Super_apple_1_filename
./Super_apple_2_filename
./Super_pear_1_filename
--------------------------------------------------------------------------------
#### Nicer would be to force create subset even with 1 keyword
#### i.e. all strings with apple substring should be processed in individual subset
$ ./fsim.pl . -w apple pear
./Super_apple_1_filename
./Super_apple_2_filename
--------------------------------------------------------------------------------
./Super_pear_1_filename
./Super_pear_2_filename
--------------------------------------------------------------------------------
#### Expected output :)
$ ./fsim.pl . -w APPLE pear
./Super_apple_1_filename
./Super_apple_2_filename
./Super_pear_1_filename
--------------------------------------------------------------------------------
#### Optional case insensitivity, please, pLeAsE, PLEASE :)
##### Also, I missing in result output list of files that are not similar:
these_files.jpg
are_not_similiar.pl
and_never.mp3
be_deleted.txt
--------------------------------------------------------------------------------
Thanks!
'Super_pear_2_filename' is omitted with 50% approximation because 'Super_apple_1_filename' and 'Super_pear_2_filename' differ by more than 50%. This is because of the algorithm comparing technique; it will not compare all the files between each other, rather it takes one file and tries to find other files similar to it. (in this case, 'Super_apple_1_filename' is compared with the rest of the files)
DeleteBy changing the percent, we get can different results.
#
## Procent 45%
#
$ perl fsfn.pl -p 45 .
./Super_apple_1_filename
./Super_apple_2_filename
./Super_pear_1_filename
./Super_pear_2_filename
--------------------------------------------------------------------------------
#
## Procent: 55%
#
$ perl fsfn.pl -p 55 .
./Super_apple_1_filename
./Super_apple_2_filename
--------------------------------------------------------------------------------
./Super_pear_1_filename
./Super_pear_2_filename
--------------------------------------------------------------------------------
#
## Added the case-insensitive option '-i' and support for '-w' with one word.
#
perl fsfn.pl . -i -w AppLe
./Super_pear_1_filename
./Super_pear_2_filename
--------------------------------------------------------------------------------
./Super_apple_1_filename
./Super_apple_2_filename
--------------------------------------------------------------------------------
### The files that are not similar are not listed by the script. I need to think more about this feature...
Improved script can be found at the same address: https://github.com/trizen/perl-scripts/blob/master/Finders/fsfn.pl
Thanks much, sir! Grabbed updated version
DeleteNo problem. Thank you for sharing this ideas.
DeleteBonus: https://github.com/trizen/perl-scripts/blob/master/Finders/fsfn_lev.pl (a fuzzy finder, using the Levenshtein distance)