tzip - file compressor

tzip is (probably) the simplest text-file compressor. It compresses a file by counting the number of different bytes and map those bytes to shorter sequences of bits.

Each byte is composed from 8 bits. A byte value can range from 0 to 255 (inclusive), which means that 8 bits can be mixed up to represent only 256 different bit-sequences. In a text file, however, not all this bytes are used, and for a lower number of patterns, we can use less bits per pattern.

For example, if we take the word 'youtube', it is composed from 7 bytes, but has only 6 different bytes:

      'y', 'o', 'u', 't', 'b', 'e'

To represent this bytes, we will need sequences of only three bits. To calculate exactly how many bits are required in a sequence for a given number of different bytes (n) we have the following formulas:
    ceil(log(n) / log(2));


    p = next_power_of_two(n);
    log(p) / log(2);

   where, the function next_power_of_two(), is defined as:
    2 << int(log(n) / log(2));

So, using the last formula, the next power of two after 6, is 8. Logarithm of 8 with base 2, is 3. Now, we can represent the bytes in the following way:

  y => "000",
  o => "001",
  u => "010",
  t => "011",
  b => "100",
  e => "101",

By writing the word 'youtube' using the above dictionary, we get the following sequences of bits:

   000 001 010 011 010 100 101

From this groups of bits, we need to create bytes (8 bits together), which leads us to:

  00000101 00110101 00101000

To have exactly 8 bits in the last sequence, it's required to append some zeros at the end to have the byte complete, but those bits will be ignored at decompression.

To compress the bits, we can use the pack function:

    pack "B*", '000001010011010100101';  # equal with: (chr(5).chr(53).chr(40));

pack automatically adds the extra null-bits at the end of the sequence. However, if you want to add them manually, check the length of the sequence to see if it is divisible by 8 (seqLen % 8 == 0). If it's not, add this many zeros:

   8 - (seqLen % 8);

In our case, for seqLen being 21, the number of extra bits is 3, as showed above in yellow.

The pack function returns three characters (bytes). This characters need to be stored inside the compressed file. Also, the compressed file has a header, which tells us some useful information required for decompression, such as the length of the bits, the length of the unique number of bytes, the size of the original file, the sequence of bytes and the sequence of bits.

The header looks like this:

1. | TZP1 |
2. | length of the file size |
3. | the file size bytes |
4. | the number of bits per group |
5. | the number of unique bytes - 1 |
6. | the sequence of the unique bytes |
7. | the sequence of the bits |

After the bytes from the header are unpacked, a dictionary is created, storing the bits as keys and the bytes as values.

Now, the script can read chucks of data, unpack the characters into sequences of bits, get n bits at a time, look-up into dictionary for the corresponding byte, store the byte for efficiency, and when you have more bytes collected, write them inside a new file, which will be the decompressed file.

The decompression goes like this: we have a sequence of bits (000001010011010100101000), we get three bits from the beginning of the sequence, and check the dictionary. It returns the 'y' character for the 000 bit-group. We repeat this process this many times: int(seqLen / bitsNum). Next, we'll have the bit-group 001 which is 'o', and so on. When the size of the original file is reached, we know that the next bits can be safely ignored.

Best usage of this script is to compress files which contains not so many different bytes (for example, DNA-sequences). Better still, I recommend to be used just for fun! :)

Fore more details, please refer to the source code of the script.