The last couple of weeks I have been playing around with compression/decompression algorithms. This is a field that has always intrigued me. It gives me a magical feeling, like a magician you wave some algorithm around and suddenly the files shrink and bytes dissapear. With the same motion you can undo all your actions and re-generate the original files from thin air!
Arithmetic coding is a different method to encode bytes. On itself it doesn’t compress, but it is the backbone of a whole family of compression algorithms. To explain how it works we need some example text we want to compress: “ABACB”
First we assign a probability to each symbol:
- A: 45%
- B: 40%
- C: 15%
Lets change these probabilities into ranges from 0.0 to 1.0:
- A: 0.00 to 0.45
- B: 0.45 to 0.85
- C: 0.85 to 1.00
Now we start reading our input (ABACB). The first character we find is ‘A’. First we do a lookup in the table, what range does ‘A’ have: 0.00 to 0.45. Lets remember these values as ‘low’ and ‘high’.
Time to encode the second character ‘B’. The first thing we need to do is reset our ranges to be between low and high:
- A: 0.0000 to 0.2025
- B: 0.2025 to 0.3805
- C: 0.3805 to 0.4500
We want to encode ‘B’, so we’ll end up with low=0.2025 and high=0.3805. Time to update the table again:
- A: 0.2025 to 0.2826
- B: 0.2826 to 0.3538
- C: 0.3538 to 0.3805
Encode ‘A’, we get low=0.2025 and high=0.2826. And adjust the table again:
- A: 0.202500 to 0.238545
- B: 0.238545 to 0.270585
- C: 0.270585 to 0.282600
Encode ‘C’, we get low=0.270585 and high=0.282600. Adjust the table one last time:
- A: 0.27058500 to 0.27599175
- B: 0.27599175 to 0.28079775
- C: 0.28079775 to 0.28260000
So, lets encode the final symbol ‘B’ and stop right here. Now we save/output a value betwee low (0.27599175) and high (0.28079775) which takes the least amount of bytes, for example 0.28!
Now I we want to decode our message “0.28” we start by building the ranges again, these are the same as the first table above. Now we look at which range our 0.28 fall in. The result is ‘A’ (between 0.0 and 0.45). So we output this symbol. Next we have to update the table, because we outputted ‘A’ we know our encoder encoded ‘A’, so we can follow the same steps as the encoder did and update the ranges in the same way, so we’ll end up with the second table (with values between 0.0 and 0.45). If we look at 0.28 it now falls into range ‘B’, so we output ‘B’ and adjust again.
As you can see this number “0.28” describes our whole message “ABACB”! So we encoded a whole message in one small number. There are a lot of improvements possible to implement this algorithm efficiently, for example look at range encoding. Another website that helped me a lot is this basic arithmetic coding by Arturo Campos.
The previous example works great, but there is a problem… if we assign equal probabilities to each possible symbol (each byte) we won’t compress anything! The example above works because I took bigger probabilities for A and B then for C… So for it to work we need to guess the correct probabilities, and the better we do this, the smaller our file gets! Luckely there are a couple of methods to assign/guess these probabilities.
Read and save table
One thing you can do is to first read the whole file and save the amount you see each character. Then you can scale these amounts into probabilities between 0.0 and 1.0 and save these probabilities to a file. Next we can use these probabilities to encode (and decode) our message! This will take quite a bit of overhead because you have to save the table..
Another thing we can do is to just start with equal probabilities for each symbol and adjust the probabilities during the encoding. If we use the same method in encoding and decoding our probability table will remain the same.
Prediction by Partial Matching (PPM)
So, we can adjust the probabilities at runtime and for example count all the symbols we have seen. But why stop there? For example in a piece of English text, the letter ‘E’ will probably have the highest count/probabilty. But if we encouter the symbols: “COMPRESSI” we know the next character will likely be an “O”, not an “E”! So what if we can enlarge our context? This is what PPM does, looking at the bigger picture. Not only count single symbols, but also combinations. The longer the combinations the more specific our predictions get (but also more memory is needed).
Context Mixing (CM)
With PPM you have one model which predicts the probability of the next symbol by looking at its past statistics. But why stop there? With Context Mixing (CM) you can have multiple models predicting the next symbol work together. When you do this right you and predict the next symbol even better and thus get better compression ratios! This is how the best current compression algorithms work.
This is just one family of compression algorithms, there are a lot more which I won’t describe here (yet?). Instead of processing/guessing the next symbol you could also replace long series of symbols with shorter unique symbols. This is known as dictionary coding, an entire other family of compressors (including LZ, zip, gz).