Finding similar images


In this post we're going to implement a simple algorithm which finds images that look similar to each other.

# Overview

In order to find images that look similar, we first need to normalize them to a simpler form and compute a fingerprint for each image.

The fingerprint is a binary sequence of ones and zeros and it's created from the average color of the image and the average color of each pixel.

In order to make the process faster and more flexible (and also to have fingerprints of the same length) the image is first resized to a fixed resolution, which defaults to 64x64.

# Algorithm

This is the algorithm which creates the fingerprint of an image:

averages = []
for y in range(height) {
   for x in range(width) {
      rgb = img.get_pixel(x, y)
      averages.append(sum(rgb) / len(rgb))
   }
}

avg = sum(averages) / len(averages)
fingerprint = averages.map {|v| v < avg ? 1 : 0 }

It first creates the "averages" array which is populated with the average values of each pixel, from which it takes the average value to represent the average value of the entire image.

In the end, it creates the "fingerprint" array by mapping each pixel average value from "averages" to the image average value ("avg").

To illustrate the conversion, let's consider the following image:


By applying the algorithm, all the pixels get translated into a 1-0 pattern:

00000000000000000000000000000000000001000000000111
00000000000000000000000000000000000001111000011111
00000000000000000000000000000000000001111111111111
00000000000000000000000000000000000000111111111110
00000000000000000000000000000000000000111111111100
00000000000000000000000000000000000000111111111100
00000000000000000000000000000000000011111111111000
00000000000000000000000000000000111111111111111000
00000000000000000000000000111111111111111111110000
00000000000000000000000111111111111111111111110000
00000000000000000000011111111111111111111111110000
00000000000000000000111111111111111111111111110000
00000000000000000001111111111111111111111111000000
00000000000000000011111111111111111111111110000000
00000110000000000111111111111111111111111000000000
00111100000000000111111111111111111111110000000000
01100000000000000011111111111111111111110000000000
11100000000000000011111111111111111011110000000000
11100000000000000011111111111111111011110000000000
11100000000000000111111111111111110001111000000000
01111111111111111111111111111111100000111110000000
00001111111111111111111111111110000000000000000000


The process of comparing two fingerprints can be done in many ways, one of which is the following:
  1. count the number of overlapping bits (XOR)
  2. divide the above value to the number of bits
  3. square the result

Example:
  bits_len = 64;
  overlapping_bits = 50;
  result = (overlapping_bits / bits_len)^2;   # 0.61

A closer value to one will indicate that the fingerprints are more likely similar to each other. To calculate the percentage of similarity, we can simply multiply the result by 100:

  percentage = result * 100;   # 61%

Using the percentage value, we can group together images that are at least a given percent similar to each other and create a final report at the end.

# Conclusion

The described algorithm is very fast and flexible. In addition, the sequence of ones and zeros can be used for other purposes as well, such as generating a similarity-hash for a given image, by creating hexadecimal values from groups of 16 bits.

# Implementations

Find similar-looking images:
Bonus: