Finding similar file names


This is a story about the creation of a nice script...

A friend of mine needed a script to find files which are similar, without reading their content. She suggested comparing the file attributes, like the file size and some other attributes.

So, I gave her this script. It works great when there not many files which have the same size. So, you can safely say that the files are similar if they have the same attributes.

But in some cases, this may be a real disaster because there may be lots of files that have the same attributes and are not actually similar. So, I came up with a new idea. I suggested grouping the files by size and compare their names.

If the shortest name is contained in the biggest name, then the files are similar. But not so fast. We need some rules, because we can have two files like: 'A happy file name.txt' and 'a.txt'! They are not similar, even if the shortest name 'a' is found in the longest name 'A happy file name'.

So, how to solve this?

Well, if the length of the shortest filename is lower than the half of the length of the longest filename, then they are not similar.
In other words, only when a substring of the shortest filename which has a length greater or equal than the half length of the longest filename and is a substring of the longest filename, then the files are similar. 

Example: 
     'zzzgooglezzz' is similar with 'google', but not similar with 'googl'.

We all know that matching approximately is really slow, so I avoided using any approximation, even if I did this once in a very old-script.

This time, I wanted something fast and reliable. So, after many head-scratches, I wrote a pretty nice algorithm. It finds the longest substrings in the shortest filename, and when a substring is found in the the longest filename, it just returns the code 0. This is the script which I gave to my friend. It groups the files by size and compare their file names.

But, after this, a new idea came into mind. What about not grouping the files by size, but let's put them all together and compare the file names of each other.

This turned out to be extremely useful. I've found and deleted many MP3 files which had almost the same name, but different sizes - which didn't allowed me to find them when searching for duplicated files with various tools, like fdf.

To make it more interesting, all the substrings are compared case-insensitively, and the file names are UTF-8 decoded, so the character 'ș' is similar with 'Ș'.

We can, also make 'ă' to be similar with 'a', by converting the unicode characters to ASCII, using the Text::Unidecode module, but this is a little bit overkill, so I will leave this for the reader to implement. :)

Here is a screenshot:



Original Perl script: