Learning <=> Compression <=> Prediction

ZhiZhi Gewu

03 Jul 2024 — 9 min read

Introduction

Occasionally, I will contemplate the meaning of learning. I learned very early that learning is not memorisation. One of the reasons is that I hate memorising facts without understanding them. Also, thanks to the Hong Kong education system, there is no need to memorise many facts as equations and formulas are provided in exams, and open-book exams are common in Universities. But what is understanding? I think I am a fast learner. I studied Maths, Physics and Chemistry in High School. Then, I joined the School of Business at the University. I didn't like it, so I changed my major to Environmental Science after working in the Environmental Field for 1.5 years. I decided to change my career to Data Analyst. Now, I am working in the Finance Industry again and writing this note about intelligence. Over the years, I have summarised my tips for learning:

Try to abstract out the underlying mechanisms that lead to the observed phenomena.
Describe that mechanism in the simplest terms possible.
Apply that model to similar and different situations.
Adjust the model based on the contradicting evidence by not adding exceptions.

I gave similar advice to my subordinates. The common response is that they are not practical and too abstract. Indeed, learning about learning is difficult. Usually, I will give them a few examples to illustrate my points. First, if the examination is structured so that you are presented with a new problem you have never seen before, as if you are asked to read the textbook for the first time, is it a better way to test your ability to understand any subject? Second, if ChatGPT can perform well in a wide range of tasks, does it mean that the model understands the world well, or is it just memorising the training data?

I think we can, to some extent, agree that learning is not memorisation. Learning is about recognising patterns and underlying mechanisms. Learning is also about the ability to generalise.

More Formal "Understanding" of Learning

The reader should not treat the following as a formal or true definition of learning as I am not an expert in the field. I am just a curious person who likes to think about things and write them down.

I used the biconditional symbol \(\iff\) to denote that learning, compression, and prediction are equivalent. It may be easy to recognise that if you learn by abstracting out the underlying mechanisms, the description of the mechanisms should be smaller than the observed phenomena (I used the word "smaller" without defining what the unit of measurement is). That is compression. However, does it mean that if you can compress the data, you have learned something? If we learned the underlying mechanisms, we should be able to predict the future. However, if we are able to predict the future, does it mean that we have learned the underlying mechanisms?

Examples

Consider the sequence of 1000 digits "14159...01989". If it looks random, you can neither compress it nor predict the next digit [^5]. In fact, they are the first 1000 digits of \(\pi\). If you know that, you can compress the data by using the formula to calculate the digits of \(\pi\) and predict the next digit.
Consider the sequence "THERE-IS-NO-REVERSE-ON-A-MOTOCYCLE-" and imagine that the speaker repeatedly attempts to predict the next character[^9]. The number of guesses for each character can be "(1115)(1)(21)(1)(21)(1)("15"1"17"1112)(1)(32)(1)(2)(2)(71111411111)". Notice that in many cases, the next character is guessed immediately with one guess. We can leverage this to compress the data.

Shannon's Source Coding Theorem[^9]

I first read about the paper, 'Language Modeling Is Compression'[^1]'. I found that some prerequisite knowledge in Information Theory is required to understand the paper. For example, lossless compression and arithmetic coding. I then spent four days to read through Chapter 1,2,4,5, and 6 from David MacKay's book[^9] and its accompanying lecture video[^10].

Ensemble and Entropy

This is to define the basic concepts in Information Theory.

An ensemble \(X\) is a triple \((x, A_X, P_x)\), where the outcome \(x\) is the value of a random variable, which takes on one of a set of possible values \(A_X = {a_1, a_2, ..., a_I}\) with probabilities \(P_x = {p_1, p_2, ..., p_I}\), with \(P(x=a_i) = p_i, p_i \geq 0, \sum_{i=1}^{I}p_i = 1\). (It looks complicated, but it is just a way to represent the probability distribution of the data.)
Shannon Information Content \(h(x) = log_2(1/p(x))\) Bits. For example, if there are 8 possible outcomes with equal probability, you need \(log_2 8 = 3\) bits to encode them because \(2^3=8\).
Entropy \(H(X)\) is a measure of the average information content of an outcome of \(X\). It is defined as \(H(X) = \sum_{i=1}^{I}p_i log_2(1/p_i)\) Bits. (The weighted sum of the information content of each outcome.)
- Notice that the more improbable the outcome, the more information it contains. The maximum entropy is when all outcomes are equally probable. The minimum entropy is when only one outcome is possible.

The Raw Bit Content of \(X\) and the Essential Bit Content of \(X\)

One way of measuring the information content of a random variable is simply to count the number of possible outcomes, \(|A_X|\), and take the base-2 logarithm of that number, \(H_0(X)=log_2|A_X|\). This is called the raw bit content of \(X\).
- \(H_0(X)\) is a lower bound for the number of binary questions that are always guaranteed to identify an outcome of \(X\).
- This measure of information content does not include any probabilistic element, and the encoding rule it corresponds to does not 'compress' the source data; it simply maps each outcome to a constant-length binary string.
Let \(\delta\) be the probability that there will be no name for an outcome \(x\). The smallest \(\delta\)-sufficient subset \(S_\delta\) is the smallest subset of \(A_X\) satisfying \(P(x \in S_\delta) \geq 1-\delta\). The subset \(S_\delta\) can be constructed by starting with the most probable outcome and adding the next most probable outcome until the total probability exceeds \(1-\delta\). The essential bit content of \(X\) is \(H_\delta(X) = log_2|S_\delta|\) Bits.

The Source Coding Theorem

Consider an ensemble \((X_1, X_2,...,X_N)\) of \(N\) independent and identically distributed random variables \(X\) with entropy \(H(X) = H\) bits. Since entropy is additive for independent random variables, the entropy of the ensemble is \(NH\) bits. The source coding theorem states that as long as we are allowed a tiny probability of error, we can compress the data down to \(NH\) bits. Even if we are allowed a large probability of error, we can still compress only down to \(NH\) bits. I found the second part of the statement more surprising since I thought that if we are allowed a large probability of error, we can compress further. The entropy \(H\) is the lower bound on the number of bits needed to encode each symbol from the source without losing information. I will state the Theorem without proof (For interested readers, please refer to the book[^9] or the lecture video[^10]):

Shannon's source coding theorem. Let \(X\) be an ensemble with entropy \(H(X)=H\) bits. Given \(\epsilon > 0\) and \(0 < \epsilon < 1\), there exists a positive integer \(N_0\) such that for all \(N > N_0\),
\[\left| \frac{1}{N}H_\delta(X_1, X_2, ..., X_N) - H \right| < \epsilon\]

Minimal Expected Length

The idea is that we can achieve compression, on average, by assigning shorter encodings to the more probable outcomes and longer encodings to the less possible ones.

Symbol Code

A binary symbol code \(C\) for an ensemble \(X\) is a mapping from the range of \(X\), \(A_X = \{a_1, a_2, ..., a_I\}\), to \(\{0,1\}^+\). \(c(x)\) will denote the codeword corresponding to \(x\), and \(l(x)\) will denote its length, with \(l_i = l(a_i)\). The extended code \(C^+\) is a mapping from \(A_X^+\) to \(\{0,1\}^+\) obtained by concatenation, without punctuation, of the corresponding codewords: \(C^+(x_1, x_2, ..., x_N) = c(x_1)c(x_2)...c(x_N)\). For example, \(c^+(acdbac) = c(a)c(c)c(d)c(b)c(a)c(c) = 100001000001010010000010\) (a: 1000, b: 0100, c: 0010, d: 0001).

Prefix Code

A prefix code is a symbol code in which no codeword is a prefix of another. It is also called a self-punctuating code. For example, \(C_1 = {0,101}\) is a prefix code and \(C_2 = {1,101}\) is not a prefix code. It saves space by eliminating the need for punctuation between codewords.

Expected Length of a Symbol Code

The expected length of a symbol code is related to the entropy of the ensemble. \(L(C,X) = \sum_{i=1}^{I}P_il_i \ge \sum_{i=1}^{I}p_ilog_2(1/p_i) = H(X)\) Bits.

The source coding theorem for symbol codes states that an ensemble \(X\) there exists a prefix code \(C\) with expected length satisfying \(H(X) \le L(C,X) < H(X) + 1\) Bits.

Example: Huffman Coding

Let's illustrate how to leverage the probability distribution of the data to reduce the number of bits needed to represent the data. Huffman coding is a method to construct a prefix code that achieves the expected length of \(H(X)\). The steps are as follows:

Sort the probabilities in decreasing order.
Combine the two least probable symbols into a new symbol with probability equal to the sum of the two.
Repeat step 2 until there is only one symbol left.

Let's assume we have a source that produces four different symbols: \(A,B,C,D\) with the probabilities \(P(A)=0.4\), \(P(B)=0.3\), \(P(C)=0.2\), \(P(D)=0.1\).

Combine \(C\) and \(D\) to form \(CD\) with probability \(0.3\).

    0.3
   /   \
  C     D

Combine \(CD\) and \(B\) to form \(BCD\) with probability \(0.6\).

Combine \(BCD\) and \(A\) to form \(ABCD\) with probability \(1\).

    1
   / \
  A  BCD
     /  \
    B    CD
        /  \
       C    D

Assign \(0\) to the left branch and \(1\) to the right branch.

A: 0
B: 10
C: 110
D: 111

The expected length of the code is \(0.4 \cdot 1 + 0.3\cdot 2 + 0.2\cdot 3 + 0.1\cdot 3 = 1.9\) Bits. For a fixed-length code for 4 symbols, we need \(log_2 4 = 2\) Bits. The Huffman code saves \(0.1\) Bits per symbol. The entropy of the ensemble is \(H(X) = 0.4 \cdot log_2(1/0.4) + 0.3\cdot log_2(1/0.3) + 0.2 \cdot log_2(1/0.2) + 0.1\cdot log_2(1/0.1) = 1.846\) Bits. The Huffman code achieves roughly the entropy of the ensemble.

The example illustrates that by leverage the probability distribution of the data, we can compress the data. Also, it seems to me that to compress the data, we must have leveraged the probability distribution of the data.

Cross-Entropy and Loss Function[^6]

The cross-entropy is the average number of bits needed to encode data coming from a source with a distribution \(p\) when use model \(q\). It is defined as \(H(p,q) = -\sum_{x}p(x)log_2q(x)\). It is easily confused with KL Divergence, which is the average number of extra bits needed to encode the data, due to the fact that we used distribution \(q\) to encode the data instead of the true distribution \(q\). The KL Divergence is defined as \(D_{KL}(p||q) = \sum_{x}p(x)log_2\frac{p(x)}{q(x)}\). The cross-entropy \(H(p,q) = H(p) + D_{KL}(p||q)\).

Observe that the negative log-likelihood for logistic regression is given by \(NLL(w) = -\sum_{i=1}^{N}y_ilog(\hat{y}_i) + (1-y_i)log(1-\hat{y}_i)\), where \(y_i\) is the true label and \(\hat{y}_i\) is the predicted probability [^12], which is the same quantity as the cross-entropy \(p(x)= y; q(x) = \hat(y)\).

Prediction is, therefore, equivalent to compression.

Causal Models and Interaction with the Environment

While learning probability distribution helps compress the data and vice versa, we cannot observe the probability distribution of the counterfactual world. In other words, how much can we learn about the world by observing the data? We may be able to learn the causal model by observing a lot of data. But, I think we should give weight to the importance of data. For example, data from randomised control trials should be given more importance. Furthermore, if the model can interact with the World by performing actions like humans, it can decide what data to collect and learn from feedback.

Conclusion

We have demonstrated that by leveraging probability distribution of the data, we can compress the data and (I believe) vice versa. We have also shown that prediction is equivalent to compression by giving an example in binary classification. We have also discussed the importance of data in learning the causal model of the world. We have also discussed the importance of the model's ability to interact with the world to learn the causal model.

Further Study

Kolmogorov Complexity: The concept has been mentioned in the YouTube video[^3] by Ilya Sutskever and also in the book, An Introduction to Universal Artificial Intelligence[^13]. It seems to be an important concept in the field of AI. I will study it in the future.
The link between neural networks and compression: I haven't had the time to understand the connection. It is actually mentioned in the book, Information, Inference, and Learning Algorithms[^9].
The transformer architecture and compression: why does it work so well?

There are many more things to learn. Understanding intelligence is not an easy task.

References

Delétang, Grégoire, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, et al. ‘Language Modeling Is Compression’. arXiv, 18 March 2024. http://arxiv.org/abs/2309.10668.
How DeepMind's Language Models Beat PNG & FLAC Compression (codingconfessions.com)
An Observation on Generalization (youtube.com)
Learning is Compression | In Search of Great Ideas (eitanporat.github.io)
http://prize.hutter1.net/hfaq.htm#compai
A Gentle Introduction to Cross-Entropy for Machine Learning - MachineLearningMastery.com
Compression for AGI - Jack Rae | Stanford MLSys #76 (youtube.com)
What Is ChatGPT Doing … and Why Does It Work?—Stephen Wolfram Writings
MacKay, David J. C. Information Theory, Inference, and Learning Algorithms. 22nd printing. Cambridge: Cambridge University Press, 2019.
Lecture 1: Introduction to Information Theory (youtube.com)
Shannon, C. E. ‘A Mathematical Theory of Communication’. Bell System Technical Journal 27, no. 3 (July 1948): 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
Murphy, Kevin P. Machine Learning: A Probabilistic Perspective. 4. print. (fixed many typos). Adaptive Computation and Machine Learning Series. Cambridge, Mass.: MIT Press, 2013.
Hutter, Marcus, Elliot Catt, and David Quarel. An Introduction to Universal Artificial Intelligence. First edition. Chapman & Hall/CRC Artificial Intelligence and Robotics Series. Boca Raton: Chapman & Hall, CRC Press, 2024.