The most straight-forward way to compose music with an Recurrent Neural Network (RNN) is to use the network as single-step predictor. The network learns to predict notes at time t + 1 using notes at time t as inputs. After learning has been stopped the network can be seeded with initial input values – perhaps from training data – and can then generate novel compositions by using its own outputs to generate subsequent inputs. This note-by-note approach was first examined by Todd.
A feed-forward network would have no chance of composing music in this fashion. Lacking the ability to store any information about the past, such a network would be unable to keep track of where it is in a song. In principle an RNN does not suffer from this limitation. With recurrent connections it can use hidden layer activations as memory and thus is capable of exhibiting (seemingly arbitrary) temporal dynamics. In practice, however, RNNs do not perform very well at this task. As Mozer aptly wrote about his attempts to compose music with RNNs,
While the local contours made sense, the pieces were not musically coherent, lacking thematic structure and having minimal phrase structure and rhythmic organization.
The reason for this failure is likely linked to the problem of vanishing gradients (Hochreiter et al.) in RNNs. In gradient methods such as Back-Propagation Through Time (BPTT) (Williams and Zipser) and Real-Time Recurrent Learning (RTRL) error flow either vanishes quickly or explodes exponentially, making it impossible for the networks to deal correctly with long-term dependencies. In the case of music, long-term dependencies are at the heart of what defines a particular style, with events spanning several notes or even many bars contributing to the formation of metrical and phrasal structure. The clearest example of these dependencies are chord changes. In a musical form like early rock-and-roll music for example, the same chord can be held for four bars or more. Even if melodies are constrained to contain notes no shorter than an eighth note, a network must regularly and reliably bridge time spans of 32 events or more.
The most relevant previous research is that of Mozer, who did note-by-note composition of single-voice melodies accompanied by chords. In the “CONCERT” model, Mozer used sophisticated RNN procedures including BPTT, log-likelihood objective functions and probabilistic interpretation of the output values. In addition to these neural network methods, Mozer employed a psychologically-realistic distributed input encoding (Shepard) that gave the network an inductive bias towards chromatically and harmonically related notes. He used a second encoding method to generate distributed representations of chords.
A BPTT-trained RNN does a poor job of learning long-term dependencies. To offset this, Mozer used a distributed encoding of duration that allowed him to process a note of any duration in a single network timestep. By representing in a single timestep a note rather than a slice of time, the number of time steps to be bridged by the network in learning global structure is greatly reduced. For example, to allow sixteenth notes in a network which encodes slices of time directly requires that a whole note span at minimum 16 time steps. Though networks regularly outperformed third-order transition table approaches, they failed in all cases to find global structure. In analyzing this performance Mozer suggests that, for the note-by-note method to work it is necessary that the network can induce structure at multiple levels. A First Look at Music Composition using LSTM Recurrent Neural Networks