Conceptual Jump Size & Solomonoff Induction. Note Quote.

u4DeG

Let M be a reference machine which corresponds to a universal computer with a prefix-free code. In a prefix-free code, no code is a prefix of another. This is also called a self-delimiting code, as most reasonable computer programming languages are. Ray Solomonoff inquired the probability that an output string x is generated by M considering the whole space of possible programs. By giving each program bitstring p an a priori probability of 2−|p|, we can ensure that the space of programs meets the probability axioms by the extended Kraft inequality. An instantaneous code (prefix code, tree code) with the code word lengths l1,…,lN exists if and only if

i=1N 2-Li ≤ 1

In other words, we imagine that we toss a fair coin to generate each bit of a random program. This probability model of programs entails the following probability mass function (p.m.f.) for strings x ∈ {0, 1}∗:

PM(x) = ∑M(p)=x* 2-|p| —– (1)

which is the probability that a random program will output a prefix of x. PM(x) is called the algorithmic probability of x for it assumes the definition of program-based probability.

Using this probability model of bitstrings, one can make predictions. Intuitively, we can state that it is impossible to imagine intelligence in the absence of any prediction ability: purely random behavior is decisively non-intelligent. Since, P is a universal probability model, it can be used as the basis of universal prediction, and thus intelligence. Perhaps, Solomonoff’s most significant contributions were in the field of AI, as he envisioned a machine that can learn anything from scratch.

His main proposal for machine learning is inductive inference (Part 1, Part 2), for a variety of problems such as sequence prediction, set induction, operator induction and grammar induction. Without much loss of generality, we can discuss sequence prediction on bitstrings. Assume that there is a computable p.m.f. of bitstrings P1. Given a bitstring x drawn from P1, we can define the conditional probability of the next bit simply by normalizing. Algorithmically, we would have to approximate (1) by finding short programs that generate x (the shortest of which is the most probable). In more general induction, we run all models in parallel, quantifying fit-to-data, weighed by the algorithmic probability of the model, to find the best models and construct distributions; the common point being determining good models with high a priori probability. Finding the shortest program in general is undecidable, however, Levin search can be used for this purpose. There are two important results about Solomonoff induction that we shall mention here. First, Solomonoff induction converges very rapidly to the real probability distribution. The convergence theorem shows that the expected total square error is related only to the algorithmic complexity of P1, which is independent from x. The following bound is discussed at length with a concise proof:

EP [∑m=1n (P(am+1 = 1|a1a2 …am) – P1(am+1 = 1|a1a2…am))2] ≤ -1/2 ln P(P1) —– (2)

This bound characterizes the divergence of the Algorithmic Probability (ALP) solution from the real probability distribution P1. P(P1) is the a priori probability of P1 p.m.f. according to our universal distribution PM. On the right hand side of (2), −lnPM(P1) is roughly kln2 where k is the Kolmogorov complexity of P1 (the length of the shortest program that defines it), thus the total expected error is bounded by a constant, which guarantees that the error decreases very rapidly as example size increases. In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of the shortest computer program  that produces the object as output. It is measure of the computational resources needed to specify the object, and is also known as descriptive complexity, Kolmogorov–Chaitin complexity, algorithmic entropy, or program-size complexity. Secondly, there is an optimal search algorithm to approximate Solomonoff induction, which adopts Levin’s universal search method to solve the problem of universal induction. Universal search procedure time-shares all candidate programs according to their a priori probability with a clever watch-dog policy to avoid the practical impact of the undecidability of the halting problem. The search procedure starts with a time limit t = t0, in its iteration tries all candidate programs c with a time limit of t.P(c), and while a solution is not found, it doubles the time limit t. The time t(s)/P (s) for a solution program s taking time t(s) is called the Conceptual Jump Size (CJS), and it is easily shown that Levin Search terminates in at most 2.CJS time. To obtain alternative solutions, one may keep running after the first solution is found, as there may be more probable solutions that need more time. The optimal solution is computable only in the limit, which turns out to be a desirable property of Solomonoff induction, as it is complete and uncomputable.

Advertisements

One thought on “Conceptual Jump Size & Solomonoff Induction. Note Quote.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s