p.enthalabs

Kolmogorov complexity

![Image 1](https://en.wikipedia.org/wiki/File:Mandelpart2_red.png)

This image illustrates part of the [Mandelbrot set](https://en.wikipedia.org/wiki/Mandelbrot_set "Mandelbrot set")[fractal](https://en.wikipedia.org/wiki/Fractal "Fractal"). Simply storing the 24-bit color of each pixel in this image would require 23 million bytes, but a small computer program can reproduce these 23 MB using the definition of the Mandelbrot set, the corner coordinates of the image and the parameters of the color mapping. Thus, the Kolmogorov complexity of this image is much less than 23 MB in any pragmatic [model of computation](https://en.wikipedia.org/wiki/Model_of_computation "Model of computation"). [PNG](https://en.wikipedia.org/wiki/Portable_Network_Graphics "Portable Network Graphics")'s general-purpose image compression only reduces it to 1.6 MB, smaller than the raw data but much larger than the Kolmogorov complexity.

In [algorithmic information theory](https://en.wikipedia.org/wiki/Algorithmic_information_theory "Algorithmic information theory") (a subfield of [computer science](https://en.wikipedia.org/wiki/Computer_science "Computer science") and [mathematics](https://en.wikipedia.org/wiki/Mathematics "Mathematics")), the **Kolmogorov complexity** of an object, such as a piece of text, is the length of a shortest [computer program](https://en.wikipedia.org/wiki/Computer_program "Computer program") (in a predetermined [programming language](https://en.wikipedia.org/wiki/Programming_language "Programming language")) that produces the object as output. It is a measure of the [computational](https://en.wikipedia.org/wiki/Computation "Computation") resources needed to specify the object, and is also known as **algorithmic complexity**, **Solomonoff–Kolmogorov–Chaitin complexity**, **program-size complexity**, **descriptive complexity**, or **algorithmic entropy**. It is named after [Andrey Kolmogorov](https://en.wikipedia.org/wiki/Andrey_Kolmogorov "Andrey Kolmogorov"), who first published on the subject in 1963[[1]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-Kolmogorov1963-1)[[note 1]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-2) and is a generalization of classical information theory.

The notion of Kolmogorov complexity can be used to state and [prove impossibility](https://en.wikipedia.org/wiki/Proof_of_impossibility "Proof of impossibility") results akin to [Cantor's diagonal argument](https://en.wikipedia.org/wiki/Cantor%27s_diagonal_argument "Cantor's diagonal argument"), [Gödel's incompleteness theorem](https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorem "Gödel's incompleteness theorem"), and [Turing's halting problem](https://en.wikipedia.org/wiki/Halting_problem "Halting problem"). In particular, no program _P_ computing a [lower bound](https://en.wikipedia.org/wiki/Lower_bound "Lower bound") for each text's Kolmogorov complexity can return a value essentially larger than _P_'s own length (see section §Chaitin's incompleteness theorem); hence no single program can compute the exact Kolmogorov complexity for infinitely many texts.

Consider the following two strings "String (computer science)") of 32 lowercase letters and digits:

`abababababababababababababababab`, and

`4c1j5b2p0cv4w1x8rx2y39umgw5q85s7`

The first string has a short English-language description, namely "write ab 16 times", which consists of **17** characters. The second one has no obvious simple description (using the same character set) other than writing down the string itself, i.e., "write 4c1j5b2p0cv4w1x8rx2y39umgw5q85s7" which has **38** characters. Hence the operation of writing the first string can be said to have "less complexity" than writing the second.

More formally, the [complexity](https://en.wikipedia.org/wiki/Complexity "Complexity") of a string is the length of the shortest possible description of the string in some fixed [universal](https://en.wikipedia.org/wiki/Turing_complete "Turing complete") description language (the sensitivity of complexity relative to the choice of description language is discussed below). It can be shown that the Kolmogorov complexity of any string cannot be more than a few bytes larger than the length of the string itself. Strings like the _abab_ example above, whose Kolmogorov complexity is small relative to the string's size, are not considered to be complex.

The Kolmogorov complexity can be defined for any mathematical object, but for simplicity the scope of this article is restricted to strings. We must first specify a description language for strings. Such a description language can be based on any computer programming language, such as [Lisp](https://en.wikipedia.org/wiki/Lisp_programming_language "Lisp programming language"), Pascal "Pascal (programming language)"), or Java "Java (programming language)"). If **P** is a program which outputs a string _x_, then **P** is a description of _x_. The length of the description is just the length of **P** as a character string, multiplied by the number of bits in a character (e.g., 7 for [ASCII](https://en.wikipedia.org/wiki/ASCII "ASCII")).

We could, alternatively, choose an encoding for [Turing machines](https://en.wikipedia.org/wiki/Turing_machine "Turing machine"), where an _encoding_ is a function which associates to each Turing Machine **M** a bitstring <**M**>. If **M** is a Turing Machine which, on input _w_, outputs string _x_, then the concatenated string <**M**>_w_ is a description of _x_. For theoretical analysis, this approach is more suited for constructing detailed formal proofs and is generally preferred in the research literature. In this article, an informal approach is discussed.

Any string _s_ has at least one description. For example, the second string above is output by the [pseudo-code](https://en.wikipedia.org/wiki/Pseudo-code "Pseudo-code"):

def generate_string2(): return "4c1j5b2p0cv4w1x8rx2y39umgw5q85s7"

whereas the first string is output by the (much shorter) pseudo-code:

def generate_string1(): return "ab" × 16

If a description _d_(_s_) of a string _s_ is of minimal length (i.e., using the fewest bits), it is called a **minimal description** of _s_, and the length of _d_(_s_) (i.e. the number of bits in the minimal description) is the **Kolmogorov complexity** of _s_, written _K_(_s_). Symbolically,

_K_(_s_) = |_d_(_s_)|.

The length of the shortest description will depend on the choice of description language; but the effect of changing languages is bounded (a result called the _invariance theorem_, see below).

Plain Kolmogorov complexity _C_

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=3 "Edit section: Plain Kolmogorov complexity C")]

There are two definitions of Kolmogorov complexity: _plain_ and _prefix-free_. The plain complexity is the minimal description length of any program, and denoted !Image 2: {\displaystyle C(x)} while the prefix-free complexity is the minimal description length of any program encoded in a [prefix-free code](https://en.wikipedia.org/wiki/Prefix_code "Prefix code"), and denoted !Image 3: {\displaystyle K(x)}. The plain complexity is more intuitive, but the prefix-free complexity is easier to study.

By default, all equations hold only up to an additive constant. For example, !Image 4: {\displaystyle f(x)=g(x)} really means that !Image 5: {\displaystyle f(x)=g(x)+O(1)}, that is, !Image 6: {\displaystyle \exists c,\forall x,|f(x)-g(x)|\leq c}.

Let !Image 7: {\displaystyle U:2^{*}\to 2^{*}} be a computable function mapping finite binary strings to binary strings. It is a universal function if, and only if, for any computable !Image 8: {\displaystyle f:2^{*}\to 2^{*}}, we can encode the function in a "program" !Image 9: {\displaystyle s_{f}}, such that !Image 10: {\displaystyle \forall x\in 2^{*},U(s_{f}x)=f(x)}. We can think of !Image 11: {\displaystyle U} as a program interpreter, which takes in an initial segment describing the program, followed by data that the program should process.

One problem with plain complexity is that !Image 12: {\displaystyle C(xy)\not <C(x)+C(y)}, because intuitively speaking, there is no general way to tell where to divide an output string just by looking at the concatenated string. We can divide it by specifying the length of !Image 13: {\displaystyle x} or !Image 14: {\displaystyle y}, but that would take !Image 15: {\displaystyle O(\min(\ln x,\ln y))} extra symbols. Indeed, for any !Image 16: {\displaystyle c>0} there exists !Image 17: {\displaystyle x,y} such that !Image 18: {\displaystyle C(xy)\geq C(x)+C(y)+c}.[[2]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-3)

Typically, inequalities with plain complexity have a term like !Image 19: {\displaystyle O(\min(\ln x,\ln y))} on one side, whereas the same inequalities with prefix-free complexity have only !Image 20: {\displaystyle O(1)}.

The main problem with plain complexity is that there is something extra sneaked into a program. A program not only represents for something with its code, but also represents its own length. In particular, a program !Image 21: {\displaystyle x} may represent a binary number up to !Image 22: {\displaystyle \log _{2}|x|}, simply by its own length. Stated in another way, it is as if we are using a termination symbol to denote where a word ends, and so we are not using 2 symbols, but 3. To fix this defect, we introduce the prefix-free Kolmogorov complexity.[[3]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-4)

Prefix-free Kolmogorov complexity _K_

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=4 "Edit section: Prefix-free Kolmogorov complexity K")]

A **prefix-free universal Turing machine** is a universal partial computable function !Image 23: {\displaystyle U:2^{*}\rightarrow 2^{*}} whose domain is a prefix-free set of binary strings. Equivalently, no valid program for !Image 24: {\displaystyle U} is a prefix of any other, the domain satisfies the [prefix property](https://en.wikipedia.org/wiki/Prefix_code "Prefix code"). For instance, if every valid program for a universal Turing machine !Image 25: {\displaystyle U} ended with a termination string that could not appear elsewhere in the program, !Image 26: {\displaystyle U} would be prefix-free.

The prefix-free Kolmogorov complexity of a string !Image 27: {\displaystyle x} is defined by !Image 28: {\displaystyle K(x):=\min\{|c|:U(c)=x\}}the length of the shortest self-delimiting program that causes !Image 29: {\displaystyle U} to output !Image 30: {\displaystyle x}.

Different choices of prefix-free universal machines change !Image 31: {\displaystyle K(x)} by at most an additive constant.[[4]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-hutter-5)

There are some description languages which are optimal, in the following sense: given any description of an object in a description language, said description may be used in the optimal description language with a constant overhead. The constant depends only on the languages involved, not on the description of the object, nor the object being described.

Here is an example of an optimal description language. A description will have two parts:

- The first part describes another description language.

- The second part is a description of the object in that language.

In more technical terms, the first part of a description is a computer program (specifically: a compiler for the object's language, written in the description language), with the second part being the input to that computer program which produces the object as output.

**The invariance theorem follows:** Given any description language _L_, the optimal description language is at least as efficient as _L_, with some constant overhead.

**Proof:** Any description _D_ in _L_ can be converted into a description in the optimal language by first describing _L_ as a computer program _P_ (part 1), and then using the original description _D_ as input to that program (part 2). The total length of this new description _D′_ is (approximately):

|_D′_ | = |_P_| + |_D_|

The length of _P_ is a constant that doesn't depend on _D_. So, there is at most a constant overhead, regardless of the object described. Therefore, the optimal language is universal [up to](https://en.wikipedia.org/wiki/Up_to "Up to") this additive constant.

A more formal treatment

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=7 "Edit section: A more formal treatment")]

**Theorem**: If _K_ 1 and _K_ 2 are the complexity functions relative to [Turing complete](https://en.wikipedia.org/wiki/Turing_complete "Turing complete") description languages _L_ 1 and _L_ 2, then there is a constant _c_– which depends only on the languages _L_ 1 and _L_ 2 chosen– such that

!Image 32: {\displaystyle \forall s.-c\leq K_{1}(s)-K_{2}(s)\leq c}.

**Proof**: By symmetry, it suffices to prove that there is some constant _c_ such that for all strings _s_

!Image 33: {\displaystyle K_{1}(s)\leq K_{2}(s)+c}.

Now, suppose there is a program in the language _L_ 1 which acts as an interpreter "Interpreter (computing)") for _L_ 2:

def interpret_language(p: str)

where _p_ is a program in _L_ 2. The interpreter is characterized by the following property:

Running `interpret_language` on input _p_ returns the result of running _p_.

Thus, if **P** is a program in _L_ 2 which is a minimal description of _s_, then `interpret_language`(**P**) returns the string _s_. The length of this description of _s_ is the sum of

1. The length of the program `interpret_language`, which we can take to be the constant _c_. 2. The length of **P** which by definition is _K_ 2(_s_).

This proves the desired upper bound.

History and context

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=8 "Edit section: History and context")]

[Algorithmic information theory](https://en.wikipedia.org/wiki/Algorithmic_information_theory "Algorithmic information theory") is the area of computer science that studies Kolmogorov complexity and other complexity measures on strings (or other [data structures](https://en.wikipedia.org/wiki/Data_structure "Data structure")).

The concept and theory of Kolmogorov Complexity is based on a crucial theorem first discovered by [Ray Solomonoff](https://en.wikipedia.org/wiki/Ray_Solomonoff "Ray Solomonoff"), who published it in 1960, describing it in "A Preliminary Report on a General Theory of Inductive Inference"[[5]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-6) as part of his invention of [algorithmic probability](https://en.wikipedia.org/wiki/Algorithmic_probability "Algorithmic probability"). He gave a more complete description in his 1964 publications, "A Formal Theory of Inductive Inference," Part 1 and Part 2 in _Information and Control_.[[6]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-7)[[7]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-8)

Andrey Kolmogorov later [independently published](https://en.wikipedia.org/wiki/Multiple_discovery "Multiple discovery") this theorem in _Problems Inform. Transmission_ in 1965.[[8]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-9)[Gregory Chaitin](https://en.wikipedia.org/wiki/Gregory_Chaitin "Gregory Chaitin") also presents this theorem in the _[Journal of the ACM](https://en.wikipedia.org/wiki/Journal\_of\_the\_ACM "Journal of the ACM")_– Chaitin's paper was submitted October 1966 and revised in December 1968, and cites both Solomonoff's and Kolmogorov's papers.[[9]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-10)

The theorem says that, among algorithms that decode strings from their descriptions (codes), there exists an optimal one. This algorithm, for all strings, allows codes as short as allowed by any other algorithm up to an additive constant that depends on the algorithms, but not on the strings themselves. Solomonoff used this algorithm and the code lengths it allows to define a "universal probability" of a string on which inductive inference of the subsequent digits of the string can be based. Kolmogorov used this theorem to define several functions of strings, including complexity, randomness, and information.

When Kolmogorov became aware of Solomonoff's work, he acknowledged Solomonoff's priority.[[10]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-11) For several years, Solomonoff's work was better known in the Soviet Union than in the West. The general consensus in the scientific community, however, was to associate this type of complexity with Kolmogorov, who was concerned with randomness of a sequence, while Algorithmic Probability became associated with Solomonoff, who focused on prediction using his invention of the universal prior probability distribution. The broader area encompassing descriptional complexity and probability is often called Kolmogorov complexity. The computer scientist [Ming Li](https://en.wikipedia.org/wiki/Ming_Li "Ming Li") considers this an example of the Matthew effect "Matthew effect (sociology)"): "...to everyone who has, more will be given..."[[11]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-12)

There are several other variants of Kolmogorov complexity or algorithmic information. The most widely used one is based on [self-delimiting programs](https://en.wikipedia.org/w/index.php?title=Self-delimiting_program&action=edit&redlink=1 "Self-delimiting program (page does not exist)"), and is mainly due to [Leonid Levin](https://en.wikipedia.org/wiki/Leonid_Levin "Leonid Levin") (1974).

An axiomatic approach to Kolmogorov complexity based on [Blum axioms](https://en.wikipedia.org/wiki/Blum_axioms "Blum axioms") (Blum 1967) was introduced by Mark Burgin in the paper presented for publication by Andrey Kolmogorov.[[12]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-Burgin1982-13)

We write !Image 34: {\displaystyle K(x,y)} to be !Image 35: {\displaystyle K((x,y))}, where !Image 36: {\displaystyle (x,y)} means some fixed way to code for a tuple of strings x and y.

We omit additive factors of !Image 37: {\displaystyle O(1)}. This section is based on.[[4]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-hutter-5)

**Theorem.**!Image 38: {\displaystyle K(x)\leq C(x)+2\log _{2}C(x)}

**Proof.** Take any program for the universal Turing machine used to define plain complexity, and convert it to a prefix-free program by first coding the length of the program in binary, then convert the length to prefix-free coding. For example, suppose the program has length 9, then we can convert it as follows:!Image 39: {\displaystyle 9\mapsto 1001\mapsto 11-00-00-11-\color {red}{01}}where we double each digit, then add a termination code. The prefix-free universal Turing machine can then read in any program for the other machine as follows:![Image 40: {\displaystyle [{\text{code for simulating the other machine}}][{\text{coded length of the program}}][{\text{the program}}]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/1a7fb908ae6de37e990e880d0a297db3601251fb)The first part programs the machine to simulate the other machine, and is a constant overhead !Image 41: {\displaystyle O(1)}. The second part has length !Image 42: {\displaystyle \leq 2\log _{2}C(x)+3}. The third part has length !Image 43: {\displaystyle C(x)}.

**Theorem**: There exists !Image 44: {\displaystyle c} such that !Image 45: {\displaystyle \forall x,C(x)\leq |x|+c}. More succinctly, !Image 46: {\displaystyle C(x)\leq |x|}. Similarly, !Image 47: {\displaystyle K(x)\leq |x|+2\log _{2}|x|}, and !Image 48: {\displaystyle K(x||x|)\leq |x|}.[_[clarification needed](https://en.wikipedia.org/wiki/Wikipedia:Please\_clarify "Wikipedia:Please clarify")_]

**Proof.** For the plain complexity, just write a program that simply copies the input to the output. For the prefix-free complexity, we need to first describe the length of the string, before writing out the string itself.

**Theorem. (extra information bounds, subadditivity)**

- !Image 49: {\displaystyle K(x|y)\leq K(x)\leq K(x,y)\leq \max(K(x|y)+K(y),K(y|x)+K(x))\leq K(x)+K(y)}

- !Image 50: {\displaystyle K(xy)\leq K(x,y)}

Note that there is no way to compare !Image 51: {\displaystyle K(xy)} and !Image 52: {\displaystyle K(x|y)} or !Image 53: {\displaystyle K(x)} or !Image 54: {\displaystyle K(y|x)} or !Image 55: {\displaystyle K(y)}. There are strings such that the whole string !Image 56: {\displaystyle xy} is easy to describe, but its substrings are very hard to describe.

**Theorem. (symmetry of information)**!Image 57: {\displaystyle K(x,y)=K(x|y,K(y))+K(y)=K(y,x)}.

**Proof.** One side is simple. For the other side with !Image 58: {\displaystyle K(x,y)\geq K(x|y,K(y))+K(y)}, we need to use a counting argument (page 38 [[13]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-14)).

**Theorem. (information non-increase)** For any computable function !Image 59: {\displaystyle f}, we have !Image 60: {\displaystyle K(f(x))\leq K(x)+K(f)}.

**Proof.** Program the Turing machine to read two subsequent programs, one describing the function and one describing the string. Then run both programs on the work tape to produce !Image 61: {\displaystyle f(x)}, and write it out.

Uncomputability of Kolmogorov complexity

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=11 "Edit section: Uncomputability of Kolmogorov complexity")]

#### A naive attempt at a program to compute _K_

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=12 "Edit section: A naive attempt at a program to compute K")]

At first glance it might seem trivial to write a program which can compute _K_(_s_) for any _s_, such as the following:

def kolmogorov_complexity(s: str): for i = 1 to infinity: for each string p of length exactly i if is_valid_program(p) and evaluate(p) == s return i

This program iterates through all possible programs (by iterating through all possible strings and only considering those which are valid programs), starting with the shortest. Each program is executed to find the result produced by that program, comparing it to the input _s_. If the result matches then the length of the program is returned.

However this will not work because some of the programs _p_ tested will not terminate, e.g. if they contain infinite loops. There is no way to avoid all of these programs by testing them in some way before executing them due to the non-computability of the [halting problem](https://en.wikipedia.org/wiki/Halting_problem "Halting problem").

What is more, no program at all can compute the function _K_, be it ever so sophisticated. This is proven in the following.

#### Formal proof of uncomputability of _K_

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=13 "Edit section: Formal proof of uncomputability of K")]

**Theorem**: There exist strings of arbitrarily large Kolmogorov complexity. Formally: for each natural number _n_, there is a string _s_ with _K_(_s_) ≥ _n_.[[note 2]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-15)

**Proof:** Otherwise all of the infinitely many possible finite strings could be generated by the finitely many[[note 3]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-16) programs with a complexity below _n_ bits.

**Theorem**: _K_ is not a [computable function](https://en.wikipedia.org/wiki/Computable_function "Computable function"). In other words, there is no program which takes any string _s_ as input and produces the integer _K_(_s_) as output.

The following [**proof** by contradiction](https://en.wikipedia.org/wiki/Proof_by_contradiction "Proof by contradiction") uses a simple Pascal "Pascal (programming language)")-like language to denote programs; for sake of proof simplicity assume its description (i.e. an interpreter "Interpreter (computing)")) to have a length of 1 400 000 bits. Assume for contradiction there is a program

def kolmogorov_complexity(s: str)

which takes as input a string _s_ and returns _K_(_s_). All programs are of finite length so, for sake of proof simplicity, assume it to be 7 000 000 000 bits. Now, consider the following program of length 1288 bits:

def generate_complex_string(): str for i = 1 to infinity: for each string s of length exactly i if kolmogorov_complexity(s) >= 8000000000 return s

Using `kolmogorov_complexity` as a subroutine, the program tries every string, starting with the shortest, until it returns a string with Kolmogorov complexity at least 8 000 000 000 bits,[[note 4]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-17) i.e. a string that cannot be produced by any program shorter than 8 000 000 000 bits. However, the overall length of the above program that produced _s_ is only 7 001 401 288 bits,[[note 5]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-18) which is a contradiction. (If the code of `KolmogorovComplexity` is shorter, the contradiction remains. If it is longer, the constant used in `GenerateComplexString` can always be changed appropriately.)[[note 6]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-19)

The above proof uses a contradiction similar to that of the [Berry paradox](https://en.wikipedia.org/wiki/Berry_paradox "Berry paradox"): "1 The 2 smallest 3 positive 4 integer 5 that 6 cannot 7 be 8 defined 9 in 10 fewer 11 than 12 twenty 13 English 14 words". It is also possible to show the non-computability of _K_ by reduction from the non-computability of the halting problem _H_, since _K_ and _H_ are [Turing-equivalent](https://en.wikipedia.org/wiki/Turing_degree#Turing_equivalence "Turing degree").[[14]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-20)

There is a corollary, humorously called the "[full employment theorem](https://en.wikipedia.org/wiki/Full_employment_theorem "Full employment theorem")" in the programming language community, stating that there is no perfect size-optimizing compiler.

Chain rule for Kolmogorov complexity

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=14 "Edit section: Chain rule for Kolmogorov complexity")]

The chain rule[[15]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-21) for Kolmogorov complexity states that there exists a constant _c_ such that for all _X_ and _Y_:

!Image 62: {\displaystyle K(X,Y)=K(X)+K(Y|X)+c\cdot max(1,log(K(X,Y)))}.

It states that the shortest program that reproduces _X_ and _Y_ is [no more](https://en.wikipedia.org/wiki/Big-O_notation "Big-O notation") than a logarithmic term larger than a program to reproduce _X_ and a program to reproduce _Y_ given _X_. Using this statement, one can define [an analogue of mutual information for Kolmogorov complexity](https://en.wikipedia.org/wiki/Mutual_information#Absolute_mutual_information "Mutual information").

It is straightforward to compute upper bounds for _K_(_s_)– simply [compress](https://en.wikipedia.org/wiki/Data_compression "Data compression") the string _s_ with some method, implement the corresponding decompressor in the chosen language, concatenate the decompressor to the compressed string, and measure the length of the resulting string– concretely, the size of a [self-extracting archive](https://en.wikipedia.org/wiki/Self-extracting_archive "Self-extracting archive") in the given language.

A string _s_ is compressible by a number _c_ if it has a description whose length does not exceed |_s_| − _c_ bits. This is equivalent to saying that _K_(_s_) ≤ |_s_| − _c_. Otherwise, _s_ is incompressible by _c_. A string incompressible by 1 is said to be simply _incompressible_– by the [pigeonhole principle](https://en.wikipedia.org/wiki/Pigeonhole_principle "Pigeonhole principle"), which applies because every compressed string maps to only one uncompressed string, [incompressible strings](https://en.wikipedia.org/wiki/Incompressible_string "Incompressible string") must exist, since there are 2 _n_ bit strings of length _n_, but only 2 _n_ − 1 shorter strings, that is, strings of length less than _n_, (i.e. with length 0, 1, ..., _n_−1).[[note 7]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-22)

For the same reason, most strings are complex in the sense that they cannot be significantly compressed– their _K_(_s_) is not much smaller than |_s_|, the length of _s_ in bits. To make this precise, fix a value of _n_. There are 2 _n_ bitstrings of length _n_. The uniform "Uniform distribution (discrete)")[probability](https://en.wikipedia.org/wiki/Probability "Probability") distribution on the space of these bitstrings assigns exactly equal weight 2−_n_ to each string of length _n_.

**Theorem**: With the uniform probability distribution on the space of bitstrings of length _n_, the probability that a string is incompressible by _c_ is at least 1 − 2−_c_+1 + 2−_n_.

To prove the theorem, note that the number of descriptions of length not exceeding _n_ − _c_ is given by the geometric series:

1 + 2 + 2 2 + ... + 2 _n_ − _c_ = 2 _n_−_c_+1 − 1.

There remain at least

2 _n_ − 2 _n_−_c_+1 + 1

bitstrings of length _n_ that are incompressible by _c_. To determine the probability, divide by 2 _n_.

Chaitin's incompleteness theorem

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=16 "Edit section: Chaitin's incompleteness theorem")]

![Image 63](https://en.wikipedia.org/wiki/File:Kolmogorov_complexity_and_computable_lower_bounds_svg.svg)

Kolmogorov complexity _K_(_s_), and two computable lower bound functions `prog1(s)`, `prog2(s)`. The horizontal axis ([logarithmic scale](https://en.wikipedia.org/wiki/Logarithmic_scale "Logarithmic scale")) enumerates all strings "String (computer science)")_s_, ordered by length; the vertical axis ([linear scale](https://en.wikipedia.org/wiki/Linear_scale "Linear scale")) measures Kolmogorov complexity in [bits](https://en.wikipedia.org/wiki/Bit "Bit"). Most strings are incompressible, i.e. their Kolmogorov complexity exceeds their length by a constant amount. 9 compressible strings are shown in the picture, appearing as almost vertical slopes. Due to Chaitin's incompleteness theorem (1974), the output of any program computing a lower bound of the Kolmogorov complexity cannot exceed some fixed limit, which is independent of the input string _s_.

By the above theorem (§Compression), most strings are complex in the sense that they cannot be described in any significantly "compressed" way. However, it turns out that the fact that a specific string is complex cannot be formally proven, if the complexity of the string is above a certain threshold. The precise formalization is as follows. First, fix a particular [axiomatic system](https://en.wikipedia.org/wiki/Axiomatic_system "Axiomatic system")**S** for the [natural numbers](https://en.wikipedia.org/wiki/Natural_number "Natural number"). The axiomatic system has to be powerful enough so that, to certain assertions **A** about complexity of strings, one can associate a formula **F****A** in **S**. This association must have the following property:

If **F****A** is provable from the axioms of **S**, then the corresponding assertion **A** must be true. This "formalization" can be achieved based on a [Gödel numbering](https://en.wikipedia.org/wiki/G%C3%B6del_numbering "Gödel numbering").

**Theorem**: There exists a constant _L_ (which only depends on **S** and on the choice of description language) such that there does not exist a string _s_ for which the statement

!Image 64: {\displaystyle K(s)\geq L} (as formalized in **S**)

can be proven within **S**.[[16]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-23)[[17]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-24)

**Proof Idea**: The proof of this result is modeled on a self-referential construction used in [Berry's paradox](https://en.wikipedia.org/wiki/Berry%27s_paradox "Berry's paradox"). We initially obtain a program which enumerates the proofs within **S** and we specify a procedure _P_ which takes as an input an integer _L_ and prints the strings _x_ which are within proofs within **S** of the statement _K_(_x_) ≥ _L_. By then setting _L_ to greater than the length of this procedure _P_, we have that the required length of a program to print _x_ as stated in _K_(_x_) ≥ _L_ as being at least _L_ is then less than the amount _L_ since the string _x_ was printed by the procedure _P_. This is a contradiction. So it is not possible for the proof system **S** to prove _K_(_x_) ≥ _L_ for _L_ arbitrarily large, in particular, for _L_ larger than the length of the procedure _P_, (which is finite).

**Proof**:

We can find an effective enumeration of all the formal proofs in **S** by some procedure

def nth_proof(n: int)

which takes as input _n_ and outputs some proof. This function enumerates all proofs. Some of these are proofs for formulas we do not care about here, since every possible proof in the language of **S** is produced for some _n_. Some of these are complexity formulas of the form _K_(_s_)≥_n_ where _s_ and _n_ are constants in the language of **S**. There is a procedure

def nth_proof_proves_complexity_formula(n: int): bool

which determines whether the _n_ th proof actually proves a complexity formula _K_(_s_)≥_L_. The strings _s_, and the integer _L_ in turn, are computable by procedure:

def string_nth_proof(n: int)

def complexity_lower_bound_nth_proof(n: int): int

Consider the following procedure:

def generate_provably_complex_string(n: int): for i = 1 to infinity: if nth_proof_proves_complexity_formula(i) and complexity_lower_bound_nth_proof(i) ≥ n return string_nth_proof(i)

Given an !Image 65: {\displaystyle n}, this procedure tries every proof until it finds a string and a proof in the formal system **S** of the formula !Image 66: {\displaystyle K(s)\geq L} for some !Image 67: {\displaystyle L\geq n}; if no such proof exists, it loops forever.

Finally, consider the program consisting of all these procedure definitions, and a main call:

generate_provably_complex_string(n₀)

where the constant !Image 68: {\displaystyle n_{0}} will be determined later on. The overall program length can be expressed as !Image 69: {\displaystyle U+log_{2}(n_{0})}, where !Image 70: {\displaystyle U} is some constant and !Image 71: {\displaystyle log_{2}(n_{0})} represents the length of the integer value !Image 72: {\displaystyle n_{0}}, under the reasonable assumption that it is encoded in binary digits. We will choose !Image 73: {\displaystyle n_{0}} to be greater than the program length, that is, such that !Image 74: {\displaystyle n_{0}>U+log_{2}(n_{0})}. This is clearly true for !Image 75: {\displaystyle n_{0}} sufficiently large, because the left hand side grows linearly in !Image 76: {\displaystyle n_{0}} whilst the right hand side grows logarithmically in !Image 77: {\displaystyle n_{0}} up to the fixed constant !Image 78: {\displaystyle U}.

Then no proof of the form "!Image 79: {\displaystyle K(s)\geq L}" with !Image 80: {\displaystyle L\geq n_{0}} can be obtained in **S**, as can be seen by an [indirect argument](https://en.wikipedia.org/wiki/Indirect_argument "Indirect argument"): If `complexity_lower_bound_nth_proof(i)` could return a value !Image 81: {\displaystyle \geq n_{0}}, then the loop inside `generate_provably_complex_string` would eventually terminate, and that procedure would return a string _s_ such that

| | !Image 82: {\displaystyle K(s)} | | | --- | --- | --- | | ≥ | !Image 83: {\displaystyle n_{0}} | by construction of `generate_provably_complex_string` |

| > | !Image 84: {\displaystyle U+log_{2}(n_{0})} | by the choice of !Image 85: {\displaystyle n_{0}} |

| ≥ | !Image 86: {\displaystyle K(s)} | since !Image 87: {\displaystyle s} was described by the program with that length |

This is a contradiction, [Q.E.D.](https://en.wikipedia.org/wiki/Q.E.D. "Q.E.D.")

As a consequence, the above program, with the chosen value of !Image 88: {\displaystyle n_{0}}, must loop forever.

Similar ideas are used to prove the properties of [Chaitin's constant](https://en.wikipedia.org/wiki/Chaitin%27s_constant "Chaitin's constant").

Minimum message length

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=17 "Edit section: Minimum message length")]

The minimum message length principle of statistical and inductive inference and machine learning was developed by C.S. Wallace "Chris Wallace (computer scientist)") and D.M. Boulton in 1968. MML is [Bayesian](https://en.wikipedia.org/wiki/Bayesian_probability "Bayesian probability") (i.e. it incorporates prior beliefs) and information-theoretic. It has the desirable properties of statistical invariance (i.e. the inference transforms with a re-parametrisation, such as from polar coordinates to Cartesian coordinates), statistical consistency (i.e. even for very hard problems, MML will converge to any underlying model) and efficiency (i.e. the MML model will converge to any true underlying model about as quickly as is possible). C.S. Wallace and D.L. Dowe (1999) showed a formal connection between MML and algorithmic information theory (or Kolmogorov complexity).[[18]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-25)

Kolmogorov randomness

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=18 "Edit section: Kolmogorov randomness")]

_Kolmogorov randomness_ defines a string (usually of [bits](https://en.wikipedia.org/wiki/Bit "Bit")) as being [random](https://en.wikipedia.org/wiki/Randomness "Randomness") if the shortest [computer program](https://en.wikipedia.org/wiki/Computer_program "Computer program") that can produce that string is about as long as the string itself. To make this precise, a string !Image 89: {\displaystyle x} of length !Image 90: {\displaystyle n} is called _Kolmogorov random_ if !Image 91: {\displaystyle K(x)\geq n+O(1)} where !Image 92: {\displaystyle K} is the prefix-free Kolmogorov complexity defined above. A random string in this sense is _incompressible_ in that it is impossible to "compress" the string into a program that is shorter than the string itself. There is at least one Kolmogorov random string of each length.[[19]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-26)

This definition can be extended to define a notion of randomness for _infinite_ sequences from a finite alphabet. These [algorithmically random sequences](https://en.wikipedia.org/wiki/Algorithmically_random_sequence "Algorithmically random sequence") can be defined in three equivalent ways. One way uses an effective analogue of [measure theory](https://en.wikipedia.org/wiki/Measure_theory "Measure theory"); another uses effective martingales "Martingale (probability theory)"). The third way defines an infinite sequence to be random if the prefix-free Kolmogorov complexity of its initial segments grows quickly enough— there must be a constant _c_ such that the complexity of an initial segment of length _n_ is always at least _n_−_c_.[[20]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-27)

Relation to entropy

[[edit](https://en.wikipedia.org/w/index.php?title=Kolmogorov_complexity&action=edit&section=19 "Edit section: Relation to entropy")]

For dynamical systems, entropy rate and algorithmic complexity of the trajectories are related by a theorem of Brudno, that the equality !Image 93: {\displaystyle K(x;T)=h(T)} holds for almost all !Image 94: {\displaystyle x}.[[21]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-28)

It can be shown[[22]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-29) that for the output of [Markov information sources](https://en.wikipedia.org/wiki/Markov_information_source "Markov information source"), Kolmogorov complexity is related to the entropy "Entropy (information theory)") of the information source. More precisely, the Kolmogorov complexity of the output of a Markov information source, normalized by the length of the output, converges almost surely (as the length of the output goes to infinity) to the entropy "Entropy (information theory)") of the source.

**Theorem.** (Theorem 14.2.5 [[23]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-:1-30)) The conditional Kolmogorov complexity of a binary string !Image 95: {\displaystyle x_{1:n}} satisfies![Image 96: {\displaystyle {\frac {1}{n}}K(x_{1:n}|n)\leq H_{b}\left({\frac {1}{n}}\sum _{i}x_{i}\right)+{\frac {\log n}{2n}}+O(1/n)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/400d28435b9fc24158b61cd6c0b625b591488044)where !Image 97: {\displaystyle H_{b}} is the [binary entropy function](https://en.wikipedia.org/wiki/Binary_entropy_function "Binary entropy function") (not to be confused with the entropy rate).

The Kolmogorov complexity function is equivalent to deciding the halting problem.

If we have a halting oracle, then the Kolmogorov complexity of a string can be computed by simply trying every halting program, in lexicographic order, until one of them outputs the string.

The other direction is much more involved.[[24]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-31)[[25]](https://en.wikipedia.org/wiki/Kolmogorov_complexity#cite_note-32) It shows that given a Kolmogorov complexity function, we can construct a function !Image 98: {\displaystyle p}, such that !Image 99: {\displaystyle p(n)\geq BB(n)} for all large !Image 100: {\displaystyle n}, where !Image 101: {\displaystyle BB} is the [Busy Beaver](https://en.wikipedia.org/wiki/Busy_beaver "Busy beaver") shift function (also denoted as !Image 102: {\displaystyle S(n)}). By modifying the function at lower values of !Image 103: {\displaystyle n} we get an upper bound on !Image 104: {\displaystyle BB}, which solves the halting problem.

Consider this program !Image 105: {\textstyle p_{K}}, which takes input as !Image 106: {\textstyle n}, and uses !Image 107: {\textstyle K}.

We prove by contradiction that !Image 108: {\textstyle p_{K}(n)\geq BB(n)} for all large !Image 109: {\textstyle n}.

Let !Image 110: {\textstyle p_{n}} be a Busy Beaver of length !Image 111: {\displaystyle n}. Consider this (prefix-free) program, which takes no input:

Let the string output by the program be ![Image 112: {\textstyle x