AI learns to solve a Rubik’s Cube in 1.2 seconds

DeepCubeA uses a neural network (which apes how the human mind processes information) along with machine learning techniques, in which an AI system learns by detecting patterns and theorizing with little human input. It adopts a reinforcement learning approach, by which it learned “how to solve increasingly difficult states in reverse from the goal state without any specific domain knowledge.”

Endgadget

ABSTRACT

 

Recently, Approximate Policy Iteration (API) algorithms have achieved super- human proficiency in two-player zero-sum games such as Go, Chess, and Shogi without human data. These API algorithms iterate between two policies: a slow policy (tree search), and a fast policy (a neural network). In these two-player games, a reward is always received at the end of the game. However, the Rubik’s Cube has only a single solved state, and episodes are not guaranteed to terminate. This poses a major problem for these API algorithms since they rely on the reward received at the end of the game. We introduce Autodidactic Iteration: an API algorithm that overcomes the problem of sparse rewards by training on a distribution of states that allows the reward to propagate from the goal state to states farther away. Autodi- dactic Iteration is able to learn how to solve the Rubik’s Cube without relying on human data. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves — less than or equal to solvers that employ human domain knowledge.

 

SOLVING THE RUBIK’S CUBE WITH APPROXIMATE POLICY ITERATION

US Stock Market Interaction Network as Learned by the Boltzmann Machine

Untitled

Price formation on a financial market is a complex problem: It reflects opinion of investors about true value of the asset in question, policies of the producers, external regulation and many other factors. Given the big number of factors influencing price, many of which unknown to us, describing price formation essentially requires probabilistic approaches. In the last decades, synergy of methods from various scientific areas has opened new horizons in understanding the mechanisms that underlie related problems. One of the popular approaches is to consider a financial market as a complex system, where not only a great number of constituents plays crucial role but also non-trivial interaction properties between them. For example, related interdisciplinary studies of complex financial systems have revealed their enhanced sensitivity to fluctuations and external factors near critical events with overall change of internal structure. This can be complemented by the research devoted to equilibrium and non-equilibrium phase transitions.

In general, statistical modeling of the state space of a complex system requires writing down the probability distribution over this space using real data. In a simple version of modeling, the probability of an observable configuration (state of a system) described by a vector of variables s can be given in the exponential form

p(s) = Z−1 exp {−βH(s)} —– (1)

where H is the Hamiltonian of a system, β is inverse temperature (further β ≡ 1 is assumed) and Z is a statistical sum. Physical meaning of the model’s components depends on the context and, for instance, in the case of financial systems, s can represent a vector of stock returns and H can be interpreted as the inverse utility function. Generally, H has parameters defined by its series expansion in s. Basing on the maximum entropy principle, expansion up to the quadratic terms is usually used, leading to the pairwise interaction models. In the equilibrium case, the Hamiltonian has form

H(s) = −hTs − sTJs —– (2)

where h is a vector of size N of external fields and J is a symmetric N × N matrix of couplings (T denotes transpose). The energy-based models represented by (1) play essential role not only in statistical physics but also in neuroscience (models of neural networks) and machine learning (generative models, also known as Boltzmann machines). Given topological similarities between neural and financial networks, these systems can be considered as examples of complex adaptive systems, which are characterized by the adaptation ability to changing environment, trying to stay in equilibrium with it. From this point of view, market structural properties, e.g. clustering and networks, play important role for modeling of the distribution of stock prices. Adaptation (or learning) in these systems implies change of the parameters of H as financial and economic systems evolve. Using statistical inference for the model’s parameters, the main goal is to have a model capable of reproducing the same statistical observables given time series for a particular historical period. In the pairwise case, the objective is to have

⟨sidata = ⟨simodel —– (3a)

⟨sisjdata = ⟨sisjmodel —– (3b)

where angular brackets denote statistical averaging over time. Having specified general mathematical model, one can also discuss similarities between financial and infinite- range magnetic systems in terms of phenomena related, e.g. extensivity, order parameters and phase transitions, etc. These features can be captured even in the simplified case, when si is a binary variable taking only two discrete values. Effect of the mapping to a binarized system, when the values si = +1 and si = −1 correspond to profit and loss respectively. In this case, diagonal elements of the coupling matrix, Jii, are zero because s2i = 1 terms do not contribute to the Hamiltonian….

US stock market interaction network as learned by the Boltzmann Machine

Music Composition Using Long Short-Term Memory (LSTM) Recurrent Neural Networks

LSTM

The most straight-forward way to compose music with an Recurrent Neural Network (RNN) is to use the network as single-step predictor. The network learns to predict notes at time t + 1 using notes at time t as inputs. After learning has been stopped the network can be seeded with initial input values – perhaps from training data – and can then generate novel compositions by using its own outputs to generate subsequent inputs. This note-by-note approach was first examined by Todd.

A feed-forward network would have no chance of composing music in this fashion. Lacking the ability to store any information about the past, such a network would be unable to keep track of where it is in a song. In principle an RNN does not suffer from this limitation. With recurrent connections it can use hidden layer activations as memory and thus is capable of exhibiting (seemingly arbitrary) temporal dynamics. In practice, however, RNNs do not perform very well at this task. As Mozer aptly wrote about his attempts to compose music with RNNs,

While the local contours made sense, the pieces were not musically coherent, lacking thematic structure and having minimal phrase structure and rhythmic organization.

The reason for this failure is likely linked to the problem of vanishing gradients (Hochreiter et al.) in RNNs. In gradient methods such as Back-Propagation Through Time (BPTT) (Williams and Zipser) and Real-Time Recurrent Learning (RTRL) error flow either vanishes quickly or explodes exponentially, making it impossible for the networks to deal correctly with long-term dependencies. In the case of music, long-term dependencies are at the heart of what defines a particular style, with events spanning several notes or even many bars contributing to the formation of metrical and phrasal structure. The clearest example of these dependencies are chord changes. In a musical form like early rock-and-roll music for example, the same chord can be held for four bars or more. Even if melodies are constrained to contain notes no shorter than an eighth note, a network must regularly and reliably bridge time spans of 32 events or more.

The most relevant previous research is that of Mozer, who did note-by-note composition of single-voice melodies accompanied by chords. In the “CONCERT” model, Mozer used sophisticated RNN procedures including BPTT, log-likelihood objective functions and probabilistic interpretation of the output values. In addition to these neural network methods, Mozer employed a psychologically-realistic distributed input encoding (Shepard) that gave the network an inductive bias towards chromatically and harmonically related notes. He used a second encoding method to generate distributed representations of chords.

Untitled

 

A BPTT-trained RNN does a poor job of learning long-term dependencies. To offset this, Mozer used a distributed encoding of duration that allowed him to process a note of any duration in a single network timestep. By representing in a single timestep a note rather than a slice of time, the number of time steps to be bridged by the network in learning global structure is greatly reduced. For example, to allow sixteenth notes in a network which encodes slices of time directly requires that a whole note span at minimum 16 time steps. Though networks regularly outperformed third-order transition table approaches, they failed in all cases to find global structure. In analyzing this performance Mozer suggests that, for the note-by-note method to work it is necessary that the network can induce structure at multiple levels. A First Look at Music Composition using LSTM Recurrent Neural Networks

Purely Random Correlations of the Matrix, or Studying Noise in Neural Networks

2-Figure1-1

Expressed in the most general form, in essentially all the cases of practical interest, the n × n matrices W used to describe the complex system are by construction designed as

W = XYT —– (1)

where X and Y denote the rectangular n × m matrices. Such, for instance, are the correlation matrices whose standard form corresponds to Y = X. In this case one thinks of n observations or cases, each represented by a m dimensional row vector xi (yi), (i = 1, …, n), and typically m is larger than n. In the limit of purely random correlations the matrix W is then said to be a Wishart matrix. The resulting density ρW(λ) of eigenvalues is here known analytically, with the limits (λmin ≤ λ ≤ λmax) prescribed by

λmaxmin = 1+1/Q±2 1/Q and Q = m/n ≥ 1.

The variance of the elements of xi is here assumed unity.

The more general case, of X and Y different, results in asymmetric correlation matrices with complex eigenvalues λ. In this more general case a limiting distribution corresponding to purely random correlations seems not to be yet known analytically as a function of m/n. It indicates however that in the case of no correlations, quite generically, one may expect a largely uniform distribution of λ bound in an ellipse on the complex plane.

Further examples of matrices of similar structure, of great interest from the point of view of complexity, include the Hamiltonian matrices of strongly interacting quantum many body systems such as atomic nuclei. This holds true on the level of bound states where the problem is described by the Hermitian matrices, as well as for excitations embedded in the continuum. This later case can be formulated in terms of an open quantum system, which is represented by a complex non-Hermitian Hamiltonian matrix. Several neural network models also belong to this category of matrix structure. In this domain the reference is provided by the Gaussian (orthogonal, unitary, symplectic) ensembles of random matrices with the semi-circle law for the eigenvalue distribution. For the irreversible processes there exists their complex version with a special case, the so-called scattering ensemble, which accounts for S-matrix unitarity.

As it has already been expressed above, several variants of ensembles of the random matrices provide an appropriate and natural reference for quantifying various characteristics of complexity. The bulk of such characteristics is expected to be consistent with Random Matrix Theory (RMT), and in fact there exists strong evidence that it is. Once this is established, even more interesting are however deviations, especially those signaling emergence of synchronous or coherent patterns, i.e., the effects connected with the reduction of dimensionality. In the matrix terminology such patterns can thus be associated with a significantly reduced rank k (thus k ≪ n) of a leading component of W. A satisfactory structure of the matrix that would allow some coexistence of chaos or noise and of collectivity thus reads:

W = Wr + Wc —– (2)

Of course, in the absence of Wr, the second term (Wc) of W generates k nonzero eigenvalues, and all the remaining ones (n − k) constitute the zero modes. When Wr enters as a noise (random like matrix) correction, a trace of the above effect is expected to remain, i.e., k large eigenvalues and the bulk composed of n − k small eigenvalues whose distribution and fluctuations are consistent with an appropriate version of random matrix ensemble. One likely mechanism that may lead to such a segregation of eigenspectra is that m in eq. (1) is significantly smaller than n, or that the number of large components makes it effectively small on the level of large entries w of W. Such an effective reduction of m (M = meff) is then expressed by the following distribution P(w) of the large off-diagonal matrix elements in the case they are still generated by the random like processes

P(w) = (|w|(M-1)/2K(M-1)/2(|w|))/(2(M-1)/2Γ(M/2)√π) —– (3)

where K stands for the modified Bessel function. Asymptotically, for large w, this leads to P(w) ∼ e(−|w|) |w|M/2−1, and thus reflects an enhanced probability of appearence of a few large off-diagonal matrix elements as compared to a Gaussian distribution. As consistent with the central limit theorem the distribution quickly converges to a Gaussian with increasing M.

Based on several examples of natural complex dynamical systems, like the strongly interacting Fermi systems, the human brain and the financial markets, one could systematize evidence that such effects are indeed common to all the phenomena that intuitively can be qualified as complex.

Distributed Representation Revisited

Figure-132-The-distributed-representation-of-language-meaning-in-neural-networks

If the conventional symbolic model mandates a creation of theory that is sought to address the issues pertaining to the problem, this mandatory theory construction is bypassed in case of distributed representational systems, since the latter is characterized by a large number of interactions occurring in a nonlinear fashion. No such attempts at theoretical construction are to be made in distributed representational systems for fear of high end abstraction, thereby sucking off the nutrient that is the hallmark of the model. Distributed representation is likely to encounter onerous issues if the size of the network inflates, but the issue is addressed through what is commonly known as redundancy technique, whereby, a simultaneous encoding of information generated by numerous interactions take place, thus ameliorating the adequacy of presenting the information to the network. In the words of Paul Cilliers, this is an important point, for,

the network used for the model of a complex system will have to have the same level of complexity as the system itself….However, if the system is truly complex, a network of equal complexity may be the simplest adequate model of such a system, which means that it would be just as difficult to analyze as the system itself.

Following, he also presents a caveat,

This has serious methodological implications for the scientists working with complex systems. A model which reduces the complexity may be easier to implement, and may even provide a number of economical descriptions of the system, but the price paid for this should be considered carefully.

One of the outstanding qualities of distributed representational systems is their adaptability. Adaptability, in the sense of reusing the network to be applicable to other problems to offer solutions. Exactly, what this connotes is, the learning process the network has undergone for a problem ‘A’, could be shared for problem ‘B’, since many of the input neurons are bounded by information learned through ‘A’ that could be applicable to ‘B’. In other words, the weights are the dictators for solving or resolving issues, no matter, when and for which problem the learning took place. There is a slight hitch here, and that being this quality of generalizing solutions could suffer, if the level of abstraction starts to shoot up. This itself could be arrested, if in the initial stages, the right kind of framework is decided upon, thus obscuring the hitch to almost non-affective and non-existence impacting factor. The very notion of weights is considered here by Sterelny as a problematic, and he takes it to attack distributed representation in general and connectionsim as a whole in particular. In an analogically witty paragraph, Sterelny says,

There is no distinction drawable, even in principle, between functional and non- functional connections. A positive linkage between two nodes in a distributed network might mean a constitutive link (eg. Catlike, in a network for tiger); a nomic one (carnivore, in the same network), or a merely associative one (in my case, a particular football team that play in black and orange.

It should be noted that this criticism on weights is derived, since for Sterelny, relationship between distributed representations and the micro-features that compose them is deeply problematic. If such is the criticism, then no doubt, Sterelny still seems to be ensconced within the conventional semantic/symbolic model. And since, all weights can take part in information processing, there is some sort of a democratic liberty that is accorded to the weights within a distributed representation, and hence any talk of constitutive, nomic, or even for that matter associative is mere humbug. Even if there is a disagreement prevailing that a large pattern of weights are not convincing enough for an explanation, as they tend to complicate matters, the distributed representational systems work consistently enough as compared to an alternative system that offers explanation through reasoning, and thereby, it is quite foolhardy to jettison the distributed representation by the sheer force of criticism. If the neural network can be adapted to produce the correct answer for a number of training cases that is large compared with the size of the network, it can be trusted to respond correctly to the previously unseen cases provided they are drawn from the same population using the same distribution as the training cases, thus undermining the commonly held idea that explanations are the necessary feature of the trustworthy systems (Baum and Haussler). Another objection that distributed representation faces is that, if representations are distributed, then the probability of two representations of the same thing as different from one another cannot be ruled out. So, one of them is the true representation, while the other is only an approximation of the representation.(1) This is a criticism of merit and is attributed to Fodor, in his influential book titled Psychosemantics.(2) For, if there is only one representation, Fodor would not shy from saying that this is the yucky solution, folks project believe in. But, since connectionism believes in the plausibility of indeterminate representations, the question of flexibility scores well and high over the conventional semantic/symbolic models, and is it not common sense to encounter flexibility in daily lives? The other response to this objection comes from post-structuralist theories (Baudrillard is quite important here. See the first footnote below). The objection of true representation, and which is a copy of the true representation meets its pharmacy in post-structuralism, where meaning is constituted by synchronic as well as diachronic contextualities, and thereby supplementing the distributed representation with a no-need-for concept and context, as they are inherent in the idea of such a representation itself. Sterelny, still seems to ride on his obstinacy, and in a vitriolic tone poses his demand to know as to why distributed representation should be regarded as states of the system at all. Moreover, he says,

It is not clear that a distributed representation is a representation for the connectionist system at all…given that the influence of node on node is local, given that there is no processor that looks at groups of nodes as a whole, it seems that seeing a distributed representation in a network is just an outsider’s perspective on the system.

This is moving around in circles, if nothing more. Or maybe, he was anticipating what G. F. Marcus would write and echo to some extent in his book The Algebraic Mind. In the words of Marcus,

…I agree with Stemberger(3) that connectionism can make a valuable contribution to cognitive science. The only place, we differ is that, first, he thinks that the contribution will be made by providing a way of eliminating symbols, whereas I think that connectionism will make its greatest contribution by accepting the importance of symbols, seeking ways of supplementing symbolic theories and seeking ways of explaining how symbols could be implemented in the brain. Second, Stemberger feels that symbols may play no role in cognition; I think that they do.

Whatever Sterelny claims, after most of the claims and counter-claims that have been taken into account, the only conclusion for the time being is that distributive representation has been undermined, his adamant position to be notwithstanding.

(1) This notion finds its parallel in Baudrillard’s Simulation. And subsequently, the notion would be invoked in studying the parallel nature. Of special interest is the order of simulacra in the period of post-modernity, where the simulacrum precedes the original, and the distinction between reality and representation vanishes. There is only the simulacrum and the originality becomes a totally meaningless concept.

(2) This book is known for putting folk psychology firmly on the theoretical ground by rejecting any external, holist and existential threat to its position.

(3) Joseph Paul Stemberger is a professor in the Department of Linguistics at The University of British Columbia in Vancouver, British Columbia, Canada, with primary interests in phonology, morphology, and their interactions. My theoretical orientations are towards Optimality Theory, employing our own version of the theory, and towards connectionist models.

 

Binary, Ternary Connect, Neural N/W Deep Learning & Eliminating Multiplications in Forward and Backward Pass

Consider a neural network layer with N input and M output units. The forward computation is y = h(W x + b) where W and b are weights and biases, respectively, h is the activation function, and x and y are the layer’s inputs and outputs. If we choose ReLU, or Rectified Linear Unit/Ramp Function as h, there will be no multiplications in computing the activation function, thus all multiplications reside in the matrix product W x. For each input vector x, N M floating point multiplications are needed.

relufamily

Binary connect eliminates these multiplications by stochastically sampling weights to be −1 or 1. Full resolution weights w ̄ are kept in memory as reference, and each time when y is needed, we sample a stochastic weight matrix W according to w ̄. For each element of the sampled matrix W, the probability of getting a 1 is proportional to how “close” its corresponding entry in w ̄ is to 1. i.e.,

P(Wij = 1) = (w ̄ij+ 1)/2;

P(Wij = −1) = 1 − P(Wij = 1)

It is necessary to add some edge constraints to w ̄. To ensure that P(Wij = 1) lies in a reasonable range, values in w ̄ are forced to be a real value in the interval [-1, 1]. If during the updates any of its value grows beyond that interval, we set it to be its corresponding edge values −1 or 1. That way floating point multiplications become sign changes.

A remaining question concerns the use of multiplications in the random number generator involved in the sampling process. Sampling an integer has to be faster than multiplication for the algorithm to be worth it.

Moving on from binary to ternary connect, whereas in the former weights are allowed to be −1 or 1, in a trained neural network, it is common to observe that many learned weights are zero or close to zero. Although the stochastic sampling process would allow the mean value of sampled weights to be zero, this suggests that it may be beneficial to explicitly allow weights to be zero.

To allow weights to be zero, split the interval of [-1, 1], within which the full resolution weight value w ̄ lies, into two sub-intervals: [−1, 0] and (0, 1]. If a weight value w ̄ij drops into one of them, we sample w ̄ij to be the two edge values of that interval,

according to their distance from w ̄ij , i.e., if w ̄ij > 0:

P(Wij =1)= w ̄ij; P(Wij = 0) = 1−w ̄ij

and if

w ̄ij <=0:

P(Wij = −1) = −w ̄ij; P(Wij = 0) = 1 + w ̄ij

Like binary connect, ternary connect also eliminates all multiplications in the forward pass.

We move from the forward to the backward pass. Suppose the i-th layer of the network has N input and M output units, and consider an error signal δ propagating downward from its output. The updates for weights and biases would be the outer product of the layer’s input and the error signal:

∆W = ηδ◦h′ (W x + b) x

∆b = ηδ◦h (W x + b)

where η is the learning rate, and x the input to the layer. While propagating through the layers, the error signal δ needs to be updated, too. Its update taking into account the next layer below takes the form:

δ = WTδ◦h′ (W x + b)

Three terms appear repeatedly in the above three equations, viz. δ, h (W x + b) and x. The latter two terms introduce matrix outer products. To eliminate multiplications, one can quantize one of them to be an integer power of 2, so that multiplications involving that term become binary shifts. The expression h′ (W x + b) contains down flowing gradients, which are largely determined by the cost function and network parameters, thus it is hard to bound its values. However, bounding the values is essential for quantization because we need to supply a fixed number of bits for each sampled value, and if that value varies too much, we will need too many bits for the exponent. This, in turn, will result in the need for more bits to store the sampled value and unnecessarily increase the required amount of computation.

While h′ (W x + b) is not a good choice for quantization, x is a better choice, because it is the hidden representation at each layer, and we know roughly the distribution of each layer’s activation.

The approach is therefore to eliminate multiplications in

∆W = ηδ◦h′ (W x + b) x

by quantizing each entry in x to an integer power of 2. That way the outer product in

∆W = ηδ◦h′ (W x + b) x becomes a series of bit shifts. Experimentally, it is discovered that allowing a maximum of 3 to 4 bits of shift is sufficient to make the network work well. This means that 3 bits are already enough to quantize x. As the float 32 format has 24 bits of mantissa, shifting (to the left or right) by 3 to 4 bits is completely tolerable. This approach is referred to as “quantized back propagation”.

If we choose ReLU as the activation function and use binary (ternary) connect to sample W, computing the term h’ (W x + b) involves no multiplications at all. In addition, quantized back propagation eliminates the multiplications in the outer product in

∆W = ηδ◦h′ (W x + b) xT.

The only place where multiplications remain is the element-wise product. From

∆W = ηδ◦h′ (W x + b) xT, ∆b = ηδ◦h (W x + b), and  δ = WTδ◦h′ (W x + b), one can see that 6 × M multiplications are needed for all computations. Like in the forward pass, most of the multiplications are used in the weight updates. Compared with standard back propagation, which would need 2MN + 6M multiplications, the amount of multiplications left is negligible in quantized back propagation.

Reality as Contingently Generating the Actual

f1-large

If reality could be copied, the mapping or simulation of a neural network digitally is possible. This scheme has some nagging problems. Chief among them being the simulated nature of neural network, as the possibility of reality vis-a-vis natural neural network is susceptible to mismatch thus leading to what could be termed non-reductionist essentialism. The other option that could aid better apprehensibility of hyperrealism and simulation in terms of neural networks is from Bhaskar Roy’s idea of critical realism. This aspect differs significantly from Baudrillard’s in that the latter takes reality as potentially open to copying, whereas the former delves into reality as a generative mechanism. According to Bhaskar’s take, reality is not something that can be copied, but something that contingently generates the actual.