[Write to Learn] Minimal Description Length and Kolmogorov Complexity

[Write to Learn] Minimal Description Length and Kolmogorov Complexity

Recently, the concept of compression and learning has piqued my interest, as discussed in a previous blog post. To explore this further, I've chosen to delve into Peter Grünwald's tutorial note, "A Tutorial Introduction to the Minimal Description Length Principle," with the intention of outlining its key concepts in this blog.

Introduction

  • What problem is MDL trying to solve?
    • MDL provides a solution to the model selection problem, that is, how does one decide among competing explanations of data given limited observations?
  • What is the key insight on which MDL bases itself?
    • Any regularity in the data can be used to compress the data, that is, to describe it using fewer symbols than the number of symbols needed to describe the data literally.
    • If learning is equated with finding regularities in data, then the more we can compress the data, the more we have learned about it.
  • What is the key insight of Kolmogorov Complexity?
    • A sequence's Kolmogorov Complexity is the shortest program's length that prints that sequence and then halts.
    • For every two general-purpose programming languages \(A\) and \(B\) and every data sequence \(D\), the length of the shortest program for \(D\) written in language \(A\) and the length of the shortest program for \(D\) written in language \(B\) differ by no more than a constant \(c\), which does not depend on the length of \(D\), as long as the sequence \(D\) is long enough (invariance theorem).
  • Why is Kolmogorov complexity not practical?
    • Uncomputable: there exists no computer program that, for every set of data \(D\), which is given \(D\) as input, returns the shortest program that prints \(D\).
    • Dependence on syntax: in practice, we are confronted with small data samples for which the invariance theorem does not say much.
  • Can you make precise "learning may be viewed as data compression" using the Crude Two-part Version of the MDL Principle?
    • Let \(H^{(1)}, H^{(2)}, \ldots\) be a list of candidate models, each containing a set of point hypotheses. The best point hypothesis \(H \in H^{(1)} \cup H^{(2)} \cup \ldots\) to explain the data \(D\) is the one which minimizes the sum \(L(H) + L(D \mid H)\), where \(L(H)\) is the length, in bits, of the description of the hypothesis; and \(L(D \mid H)\) is the length, in bits, of the description of the data when encoded with the help of the hypothesis. The best model to explain \(D\) is the smallest model containing the selected \(H\).
  • Why does the Crude MDL need to be refined?
    • \(L(H)\) of any fixed point hypothesis \(H\) can be large under one code but quite short under another. Our procedure is in danger of becoming arbitrary.
  • What are refined MDL, stochastic complexity and parametric complexity?
    • Refined MDL: we associate a code for encoding \(D\) not with a single \(H \in \mathcal{H}\), but with the full model \(\mathcal{H}\). Thus, given model \(\mathcal{H}\), we encode data not in two parts, but we design a single one-part code with lengths \(\bar{L}(D \mid \mathcal{H})\). This is called universal code because this code is designed such that whenever there is a member of \(\mathcal{H}\) that fits the data well, in the sense that \(L(D \mid H)\) is small, then the code-length \(\bar{L}(D \mid \mathcal{H})\) will also be small.
    • Stochastic complexity: \(\bar{L}(D \mid \mathcal{H})\)
    • Parametric complexity: \(\text{COMP}(\mathcal{H})\). This measures the 'richness' of model \(\mathcal{H}\). It is related to the degrees of freedom in \(\mathcal{H}\).
    • Stochastic complexity of \(D\) given \(\mathcal{H} = L(D \mid \hat{H}) + \text{COMP}(\mathcal{H})\) where \(\hat{H}\) is the best-fitting model in \(\mathcal{H}\).
  • What improvements does the Refined MDL make over the Crude MDL?
    • Removes the arbitrary aspect of crude, two-part code MDL
    • Associates parametric models with an inherent 'complexity' that does not depend on any particular description method for hypotheses.
  • Remember and elaborate on the difference between the two ideas: "Every regularity in data may be used to compress that data", and "learning can be equated with finding regularities in data".
    • The second part of the idea implies that methods for learning from data must have a clear interpretation independent of whether any of the models under consideration is 'true' or not.
    • "We never want to make the false assumption that the observed data is actually generated by a distribution of some kind," says Gaussian.
  • What are the four philosophies of MDL?
    • Regularity as compression: the best-fitting model compresses the data the most.
    • Models as languages: models are nothing more than languages for describing useful data properties. Consider the phrase 'these experimental data are quite noisy'. According to a traditional interpretation, such a statement means that a distribution with high variance generated the data. According to MDL philosophy, such a phrase means only that the data are not compressible with the currently hypothesized model. The regularities are independent of the true state of nature.
    • We have only the data: MDL has a clear interpretation which depends only on the data, and not on the assumption of any underlying 'state of nature'.
    • MDL and consistency: Let \(\mathcal{H}\) be a probabilistic model, such that each \(P \in \mathcal{H}\) is a probability distribution. A statistical procedure is called consistent relative to \(\mathcal{H}\) if, for all \(P* \in \mathcal{H}\), the following holds: suppose data are distributed according to \(P*\). Then, given enough data, the learning method will learn a good approximation of \(P*\). In MDL, we do not assume a true distribution, which may suggest that we do not care about statistical consistency. However, making no assumption does not imply that we cannot achieve consistency.

Mathematically Precise Introduction

Preliminaries

  • Probability mass functions correspond to codelength functions
  • Definition of code length function
  • Information inequality and codelength
  • Markov and Bernoulli models

Main Topics

  • Crude MDL and an example for the Markov Chains
    • Keywords: two-part codes, log-likelihood
  • Universal coding
    • Keywords: communications, two-part codes, universal codes, universal models
  • Normalised Maximum Likelihood (NML) as an optimal universal model
    • Keywords: regret, complexity, Shtarkov distribution
  • Refined MDL and its four interpretations
    • Keywords: compression, counting, Bayesian, prequential
  • General refined MDL
  • Kolmogorov complexity and ideal MDL

I will delve deeper into some of the above topics in future blog posts.