Posts

Showing posts from August, 2013

tzip - file compressor

Image
tzip is (probably) the simplest text-file compressor. It compresses a file by counting the number of different bytes and map those bytes to shorter sequences of bits.

Each byte is composed from 8 bits. A byte value can range from 0 to 255 (inclusive), which means that 8 bits can be mixed up to represent only 256 different bit-sequences. In a text file, however, not all this bytes are used, and for a lower number of patterns, we can use less bits per pattern.

For example, if we take the word 'youtube', it is composed from 7 bytes, but has only 6 different bytes:

'y','o','u','t','b','e'
To represent this bytes, we will need sequences of only three bits. To calculate exactly how many bits are required in a sequence for a given number of different bytes (n) we have the following formulas:
ceil(log(n)/log(2));
or:
p=next_power_of_two(n);log(p)/log(2);
   where, the function next_power_of_two(), is defined as:
2<<int(log(n)/log(2));