Finding similar images

In this post we're going to implement a simple algorithm which finds images that look similar to each other.

# Overview

In order to find images that look similar, we first need to normalize them to a simpler form and compute a fingerprint for each image.

The fingerprint is a binary sequence of ones and zeros and it's created from the average color of the image and the average color of each pixel.

In order to make the process faster and more flexible (and also to have fingerprints of the same length) the image is first resized to a fixed resolution, which defaults to 64x64.

# Algorithm

This is the algorithm which creates the fingerprint of an image:

averages = []
for y in range(height) {
   for x in range(width) {
      rgb = img.get_pixel(x, y)
      averages.append(sum(rgb) / len(rgb))
   }
}

avg = sum(averages) / len(averages)
fingerprint = averages.map {|v| v < avg ? 1 : 0 }

It first creates the "averages" array which is populated with the average values of each pixel, from which it takes the average value to represent the average value of the entire image.

In the end, it creates the "fingerprint" array by mapping each pixel average value from "averages" to the image average value ("avg").

To illustrate the conversion, let's consider the following image:

By applying the algorithm, all the pixels get translated into a 1-0 pattern:

00000000000000000000000000000000000001000000000111

00000000000000000000000000000000000001111000011111

00000000000000000000000000000000000001111111111111

00000000000000000000000000000000000000111111111110

00000000000000000000000000000000000000111111111100

00000000000000000000000000000000000011111111111000

00000000000000000000000000000000111111111111111000

00000000000000000000000000111111111111111111110000

00000000000000000000000111111111111111111111110000

00000000000000000000011111111111111111111111110000

00000000000000000000111111111111111111111111110000

00000000000000000001111111111111111111111111000000

00000000000000000011111111111111111111111110000000

00000110000000000111111111111111111111111000000000

00111100000000000111111111111111111111110000000000

01100000000000000011111111111111111111110000000000

11100000000000000011111111111111111011110000000000

11100000000000000111111111111111110001111000000000

01111111111111111111111111111111100000111110000000

00001111111111111111111111111110000000000000000000

The process of comparing two fingerprints can be done in many ways, one of which is the following:

count the number of overlapping bits (XOR)
divide the above value to the number of bits
square the result

Example:

  bits_len = 64;
  overlapping_bits = 50;
  result = (overlapping_bits / bits_len)^2;   # 0.61

A closer value to one will indicate that the fingerprints are more likely similar to each other. To calculate the percentage of similarity, we can simply multiply the result by 100:

  percentage = result * 100;   # 61%

Using the percentage value, we can group together images that are at least a given percent similar to each other and create a final report at the end.

# Conclusion

The described algorithm is very fast and flexible. In addition, the sequence of ones and zeros can be used for other purposes as well, such as generating a similarity-hash for a given image, by creating hexadecimal values from groups of 16 bits.

# Implementations

Find similar-looking images:

GD (fast): https://github.com/trizen/perl-scripts/blob/master/Image/gd_similar_images.pl
ImageMagick (slow): https://github.com/trizen/perl-scripts/blob/master/Image/magick_similar_images.pl

Bonus:

img2ascii: https://github.com/trizen/perl-scripts/blob/master/Image/image2ascii.pl
img2html: https://github.com/trizen/perl-scripts/blob/master/Image/image2html.pl

Mathematics and computer science