Arithmetic Coding

Arithmetic coding is a very interesting form of encoding that can be used in data compression.

In this post, we are going to look at a generalized form of arithmetic coding, as a generalized change of base (or radix).

This type of arithmetic coding works by counting the occurring frequency of each byte in a given message, then it computes the cumulative frequency of each byte from which it creates a polynomial in the base of the length of message, which will result in a large integer as output.

# Encoding

First of all, we have to calculate the occurring frequency of each byte. To illustrate this, let's use the word "DECODED".

This word has the following frequency of occurring bytes:

f = {
    C => 1
    D => 3
    E => 2
    O => 1
}

From the above frequency, we will compute the cumulative frequency, which is the sum of the above frequencies for each byte, in ASCIIbetical order:

cf = {
    C => 0    # ()
    D => 1    # (f['C']               -> 1)
    E => 4    # (f['C']+f['D']        -> 1+3)
    O => 6    # (f['C']+f['D']+f['E'] -> 1+3+2)
}

Before starting the encoding process, let's map each byte to the corresponding cumulative frequency (just for illustration):

[
    D: 1
    E: 4
    C: 0
    O: 6
    D: 1
    E: 4
    D: 1
]

Now, the encoding process begins. We have to create a polynomial in base 7 (which is the length of the message that we want to encode). Each term is multiplied correspondingly by the cumulative frequency of the current byte, which is then multiplied by the product of the frequencies of all previously occurred bytes:

L = 7^6 * cf['D']
  + 7^5 * cf['E'] * f['D']
  + 7^4 * cf['C'] * f['D'] * f['E']
  + 7^3 * cf['O'] * f['D'] * f['E'] * f['C']
  + 7^2 * cf['D'] * f['D'] * f['E'] * f['C'] * f['O']
  + 7^1 * cf['E'] * f['D'] * f['E'] * f['C'] * f['O'] * f['D']
  + 7^0 * cf['D'] * f['D'] * f['E'] * f['C'] * f['O'] * f['D'] * f['E']

which gets evaluated to:

L = 7^6 * 1
  + 7^5 * 4 *  3
  + 7^4 * 0 *  6
  + 7^3 * 6 *  6
  + 7^2 * 1 *  6
  + 7^1 * 4 * 18
  + 7^0 * 1 * 36

Finally resulting in:

L = 332515

The above result represents the lower bound of the encoding. To compress the message even further, we need to calculate the upper bound and pick a number between those that can be compressed well.

U = L + pf

where `pf` is the product of the all frequencies of bytes in the message that we want to encode:

pf = f['D'] * f['E'] * f['C'] * f['O'] * f['D'] * f['E'] * f['D']
pf = 108

giving us:

U = 332515 + 108
U = 332623

Now, we have the possibility to return any number >= L and < U.

A very practical pick would be a number which can be expressed in the base 2. We can't do this, unfortunately, for this particular message, because the next power of two after 332515 (L) is much greater than the upper bound (U). However, for longer messages, the range will tend to be much wider, providing interesting ways for exploitation.

Another reasonable pick would be a number which is a power of ten, so we can remove some trailing zeros.

For this message, it is guaranteed that we can remove two trailing zeros, because 108 (pf) is greater than 10^2. There is only one number in this range with this property, and that is: 332600.

A general formula for removing as many trailing zeros as possible, is:

pow = int(log(pf) / log(10))
enc = int((U-1) / 10^pow)

* where int() is equivalent with the floor() function for positive numbers.

For our example, we will get:

pow = int(log(108) / log(10));
pow = 2

enc = int((332623-1) / 10^2)
enc = 3326

In the end, the encoded message is: 3326 * 10^2 (or just 3326e2)

# Decoding

The decoding process is just the reverse of the encoding process. All we need, is the encoded message and the table of frequencies from the encoding process (f).

enc = 332600

From the table of frequencies, we will create the cumulative frequency table (cf), as explained in the encoding process, then we will reverse it to create the dictionary table:

dict = {
    0 => C
    1 => D
    4 => E
    6 => O
}

The next step is to fill the gaps in the dictionary, so the keys will be in increasing order from 0, up to the sum of all frequencies - 1.

The sum of the frequencies is also the base of the encoded polynomial:

base = 1 + 3 + 2 + 1
base = 7

The filling process starts at 0 and goes up to base-1 (inclusive). When a gap is encountered, it will be filled with the last character encountered, resulting in:

dict = {
    0 => C
    1 => D
    2 => D
    3 => D
    4 => E
    5 => E
    6 => O
}

Now, we can start the decoding process by iterating from i=base-1, down to i=0:

q = int(enc / base^i)
c = dict[q]
enc = int((enc - base^i * cf[c]) / f[c])

Which, for our encoded message, it will get evaluated into:

q = int(332600 / 7^6)
q = 2

c = dict[2]
c = 'D'

enc = int((332600 - 7^6 * 1) / 3)
enc = 71650

As the process continues, with each iteration it decodes another byte, creating the original message: DECODED

# Conclusion

The algorithm is, in my opinion, quite interesting, but in practical use, it has some issues, one of which is the speed; as the message gets longer, the polynomial base increases, as well as the polynomial degree, resulting in very large integers.

# Implementations

As a generalized change of radix:

Other implementations:

Practical implementations:

TAC file compressor: https://github.com/trizen/perl-scripts/blob/master/Encoders/tac-compressor.pl
TACC file compressor (with GMP): https://github.com/trizen/perl-scripts/blob/master/Encoders/tacc-compressor.pl

Mathematics and computer science