Let M be a reference machine which corresponds to a universal computer with a prefix-free code. In a prefix-free code, no code is a prefix of another. This is also called a self-delimiting code, as most reasonable computer programming languages are. Ray Solomonoff inquired the probability that an output string x is generated by M considering the whole space of possible programs. By giving each program bitstring p an *a priori* probability of 2^{−|p|}, we can ensure that the space of programs meets the probability axioms by the extended * Kraft inequality*. An instantaneous code (prefix code, tree code) with the code word lengths l

_{1},…,l

_{N}exists if and only if

∑_{i=1}^{N} 2^{-Li} ≤ 1

In other words, we imagine that we toss a fair coin to generate each bit of a random program. This probability model of programs entails the following probability mass function (p.m.f.) for strings x ∈ {0, 1}∗:

P_{M}(x) = ∑_{M(p)=x*} 2^{-|p|} —– (1)

which is the probability that a random program will output a prefix of x. P_{M}(x) is called the algorithmic probability of x for it assumes the definition of program-based probability.

Using this probability model of bitstrings, one can make predictions. Intuitively, we can state that it is impossible to imagine intelligence in the absence of any prediction ability: purely random behavior is decisively non-intelligent. Since, P is a universal probability model, it can be used as the basis of universal prediction, and thus intelligence. Perhaps, Solomonoff’s most significant contributions were in the field of AI, as he envisioned a machine that can learn anything from scratch.

His main proposal for machine learning is inductive inference (* Part 1*,

*), for a variety of problems such as*

**Part 2***. Without much loss of generality, we can discuss sequence prediction on bitstrings. Assume that there is a computable p.m.f. of bitstrings P*

**sequence prediction, set induction, operator induction and grammar induction**_{1}. Given a bitstring x drawn from P

_{1}, we can define the conditional probability of the next bit simply by normalizing. Algorithmically, we would have to approximate (1) by finding short programs that generate x (the shortest of which is the most probable). In more general induction, we run all models in parallel, quantifying fit-to-data, weighed by the algorithmic probability of the model, to find the best models and construct distributions; the common point being determining good models with high

*a priori*probability. Finding the shortest program in general is undecidable, however,

*can be used for this purpose. There are two important results about Solomonoff induction that we shall mention here. First, Solomonoff induction converges very rapidly to the real probability distribution. The convergence theorem shows that the expected total square error is related only to the algorithmic complexity of P*

**Levin search**_{1}, which is independent from x. The following

*is discussed at length with a concise*

**bound***:*

**proof**E_{P} [∑_{m=1}^{n} (P(a_{m+1} = 1|a_{1}a_{2} …a_{m}) – P_{1}(a_{m+1} = 1|a_{1}a_{2}…a_{m}))^{2}] ≤ -1/2 ln P(P_{1}) —– (2)

This bound characterizes the divergence of the Algorithmic Probability (ALP) solution from the real probability distribution P_{1}. P(P_{1}) is the *a priori* probability of P_{1} p.m.f. according to our universal distribution P_{M}. On the right hand side of (2), −lnP_{M}(P_{1}) is roughly kln2 where k is the Kolmogorov complexity of P_{1} (the length of the shortest program that defines it), thus the total expected error is bounded by a constant, which guarantees that the error decreases very rapidly as example size increases. In * algorithmic information theory*, the Kolmogorov complexity of an object, such as a piece of text, is the length of the shortest computer program that produces the object as output. It is measure of the computational resources needed to specify the object, and is also known as descriptive complexity, Kolmogorov–Chaitin complexity, algorithmic entropy, or program-size complexity. Secondly, there is an optimal search algorithm to approximate Solomonoff induction, which adopts

*to*

**Levin’s universal search method***. Universal search procedure time-shares all candidate programs according to their*

**solve the problem of universal induction***a priori*probability with a clever watch-dog policy to avoid the practical impact of the undecidability of the halting problem. The search procedure starts with a time limit t = t

_{0}, in its iteration tries all candidate programs c with a time limit of t.P(c), and while a solution is not found, it doubles the time limit t. The time t(s)/P (s) for a solution program s taking time t(s) is called the Conceptual Jump Size (CJS), and it is easily shown that Levin Search terminates in at most 2.CJS time. To obtain alternative solutions, one may keep running after the first solution is found, as there may be more probable solutions that need more time. The optimal solution is computable only in the limit, which turns out to be a desirable property of Solomonoff induction, as it is complete and uncomputable.

[…] Solomonoff’s work was seminal in that he has single-handedly axiomatized AI, discovering the minim… […]