Finding similar file names

By Daniel Șuteu June 22, 2013

This is a story about the creation of a nice script...

A friend of mine needed a script to find files which are similar, without reading their content. She suggested comparing the file attributes, like the file size and some other attributes.

So, I gave her this script. It works great when there not many files which have the same size. So, you can safely say that the files are similar if they have the same attributes.

But in some cases, this may be a real disaster because there may be lots of files that have the same attributes and are not actually similar. So, I came up with a new idea. I suggested grouping the files by size and compare their names.

If the shortest name is contained in the biggest name, then the files are similar. But not so fast. We need some rules, because we can have two files like: 'A happy file name.txt' and 'a.txt'! They are not similar, even if the shortest name 'a' is found in the longest name 'A happy file name'.

So, how to solve this?

Well, if the length of the shortest filename is lower than the half of the length of the longest filename, then they are not similar.

In other words, only when a substring of the shortest filename which has a length greater or equal than the half length of the longest filename and is a substring of the longest filename, then the files are similar.

Example:

'zzzgooglezzz' is similar with 'google', but not similar with 'googl'.

We all know that matching approximately is really slow, so I avoided using any approximation, even if I did this once in a very old-script.

This time, I wanted something fast and reliable. So, after many head-scratches, I wrote a pretty nice algorithm. It finds the longest substrings in the shortest filename, and when a substring is found in the the longest filename, it just returns the code 0. This is the script which I gave to my friend. It groups the files by size and compare their file names.

But, after this, a new idea came into mind. What about not grouping the files by size, but let's put them all together and compare the file names of each other.

This turned out to be extremely useful. I've found and deleted many MP3 files which had almost the same name, but different sizes - which didn't allowed me to find them when searching for duplicated files with various tools, like fdf.

To make it more interesting, all the substrings are compared case-insensitively, and the file names are UTF-8 decoded, so the character 'ș' is similar with 'Ș'.

We can, also make 'ă' to be similar with 'a', by converting the unicode characters to ASCII, using the Text::Unidecode module, but this is a little bit overkill, so I will leave this for the reader to implement. :)

Here is a screenshot:

Original Perl script:

https://github.com/trizen/perl-scripts/blob/master/Finders/find_similar_filenames.pl

Improved version in Perl (+ Levenshtein distance algorithm)
https://github.com/trizen/perl-scripts/blob/master/Finders/fsfn.pl

Faster version in C++:
https://github.com/trizen/cpp-learning/blob/master/fsfn.cpp

Even faster version in Go:
https://github.com/trizen/go-learning/blob/master/fsfn.go

Comments

AnonymousOctober 18, 2014 at 3:23 AM
Awesomeness!

I would buy you a beer if you add to it the ability NOT to track files as simular ones if they contain specific substring or word:

While looking on your finders on github account i noticed 2 of them:

find_similar_filenames.pl
find_similar_filenames_unidec.pl

Can you clarify what's the difference?

Thanks and keep up the good work!

P.S. Will there Python implementation after C++ and Go? ;-)
ReplyDelete
Replies
AnonymousOctober 18, 2014 at 3:44 AM
By 'ability NOT to track files as simular ones if they contain specific substring or word' I mean the following:

Let's say we want to exclude the following words: 'apple' and 'pear'. But we still don't want to have dups even with these keywords.
So, from the first subset, they should be divided on 2 subcategories and then processed regularly:

What we are having now (1 category):

-----------------
Super apple 1 filename
Super apple 2 filename
Super pear 1 filename
Super pear 2 filename
-----------------

with -f or -l option we will have only 'apple' or 'pear' in result. But I need both, since love fruits :)

What should be accomplished (2 subcategories):

-----------------
Super apple 1 filename
Super apple 2 filename
-----------------
Super pear 1 filename
Super pear 2 filename
-----------------

Thanks!

ReplyDelete
Replies
AnonymousOctober 18, 2014 at 6:46 AM
Also I'll buy you 2nd beer if you implement the ability to do comparison with minimum percentage similarity.
Like - mark the files as similar if they at least 50% // 70% // 95% the same.

Kudos to you!
ReplyDelete
Replies
AnonymousNovember 3, 2014 at 4:34 PM
1923 ./fsim.pl --words=s,s .
1924 ./fsim.pl --words=apple .
1925 ./fsim.pl --words apple .
1926 ./fsim.pl -w apple .

wasn't able to get work -w option - all the time getting back script start-screen:

usage: ./fsim.pl [options] /my/path [...]

Options:
-f --first! : keep only the first file from each group
-l --last! : keep only the last file from each group
-w --words=s,s : group individually files which contain this words
-s --size! : group files by size (default: off)
-p --percentage=i : mark the files as similar based on this percent
-r --round-up! : round up the percentange (default: off)

Example:
./fsim.pl --percentage=75 ~/Pictures

WARNING:
Options '-f' and '-l' will, permanently, delete your files!
ReplyDelete
Replies
AnonymousNovember 4, 2014 at 2:34 AM
Hey Daniel,

Just few ##### comments:

$ ls -1
Super_apple_1_filename
Super_apple_2_filename
Super_pear_1_filename
Super_pear_2_filename
fsim.pl

$ ./fsim.pl .
./Super_apple_1_filename
./Super_apple_2_filename
./Super_pear_1_filename
--------------------------------------------------------------------------------
#### Super_pear_2_filename loosed somehow (?)

$ ./fsim.pl . -w apple
./Super_apple_1_filename
./Super_apple_2_filename
./Super_pear_1_filename
--------------------------------------------------------------------------------
#### Nicer would be to force create subset even with 1 keyword
#### i.e. all strings with apple substring should be processed in individual subset

$ ./fsim.pl . -w apple pear
./Super_apple_1_filename
./Super_apple_2_filename
--------------------------------------------------------------------------------
./Super_pear_1_filename
./Super_pear_2_filename
--------------------------------------------------------------------------------
#### Expected output :)

$ ./fsim.pl . -w APPLE pear
./Super_apple_1_filename
./Super_apple_2_filename
./Super_pear_1_filename
--------------------------------------------------------------------------------
#### Optional case insensitivity, please, pLeAsE, PLEASE :)

##### Also, I missing in result output list of files that are not similar:

these_files.jpg
are_not_similiar.pl
and_never.mp3
be_deleted.txt
--------------------------------------------------------------------------------

Thanks!
ReplyDelete
Replies

Add comment

Mathematics and computer science

Finding similar file names

Comments

Post a Comment