MidiNet
A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation
Last updated
A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation
Last updated
By the time of MidiNet(03/2017) there are many deep-learning based music generation models, including WaveNet and MelodyRNN. But at that time majority of all endeavor was using RNN and their variants. WaveNet was the only major player who use CNN. One advantage of training CNN vs RNN is that the former is faster and more easily parallelizable.
Another major deep learning breakthrough during that period is GAN(Generative Adversarial Network). The use of GAN provides the needed creativity for music. MidiNet also use GAN architecture with CNN-like generator and discriminator. In the case of MidiNet the generator is to transform random noises into a 2-D scorelike representation, that “appears” to be from real MIDI. Meanwhile the discriminator takes this 2-D scorelike representation and predicts whether this is real or not.
They propose a novel conditional mechanism to use music from the previous bars to condition the generation of the present bar. They use another trainable CNN to incorporate information on the previous bars to some intermediate layers of Generator CNN. Without using recurrent unit as people used in RNN, we have some way to make use the past information.
The Conditioner, in general can encode any prior knowledge that people have on the music generation.
The idea of feature matching is proposed in the paper . The idea is instead of optimize against the output layer of D, the generator G should optimize over the intermediate layers of D. That is to say we are minimizing the difference between some intermediate layers of G and D.
The use of feature matching is when we use Conditioner CNN, we can incorporate some manual tuning in the music generation such as we can make our music following a chord progression or following a priming melody.
In their Github they use(at least part of) the source code for DCGAN, so they treat audio file just like image files.
For a normal GAN, the loss function is like the below one where you play a minimax game between Discriminator and Generator by alternatively optimize over them.
Other than this typical setting, the model also add a regularization term to the real and generated data(include some intermediate level layers) to enforce them to be close, the final loss also includes:
GAN-like models often encode prior knowledge as some vector or matrix value that is to be added in some intermediate layer of Generator G and discriminator D.
The paper illustrate some cases of how to use tensor broadcasting to add tensors(layers and prior knowledge) of different shape.
For instance, if we have an additional information of shape n that we want to add to an intermediate layer of shape a-by-b. What we can do to the shape-n vector is we duplicate each value ab times to create a tensor of shape a-by-b-by-n, then concatenate this tensor to the intermediate layer, this is called 1-D conditions.
If we use previous bar(or bars) as the condition, since our audio input has the shape of h-by-w(see previous section for the Symbolic Representation for Convolution). This is the case of 2-D conditions and we need a way to map a h-by-w matrix to a-by-b intermediate layer. So we use a trainable Conditioner CNN to do the trick. The conditioner and generator CNNs use exactly the same filter shapes in their convolution layers, so that the outputs of their convolution layers have “compatible” shapes.
The trick here is to incorporate information on the previous bar to every layer of the Generator network. We first do a feedforward start from prev_x to create tensors h0_prev through h3_prev. Then we start from a random vector z. First of all we incorporate some prior knowledge y or equivalently y_b. We do a deconv on the concat(z, y), this will create an intermediate layer of CNN which has the same shape(at least for the first two dimensions) with h0_prev, which is also one intermediate layer of the opposite version of the CNN. This makes it possible to concatenate these two tensors on the channel dimension. We repeat this steps several times for all layers. This has the effect of what I described in the Conditioner section above.
Yang, Li-Chia, Szu-Yu Chou, and Yi-Hsuan Yang. "MidiNet: A convolutional generative adversarial network for symbolic-domain music generation." arXiv preprint arXiv:1703.10847(2017).
Salimans, Tim, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. "Improved techniques for training gans." In Advances in Neural Information Processing Systems, pp. 2234-2242. 2016.
The input MIDI files is split by bars. For each bar, we use a matrix where is the number of MIDI notes that we consider. In their implementation this value is set to be 128 which represents all the notes between C0 and G10. However in their training model, they shifts all the melodies to only 2 scales C4 to B5(Although you still have all the representations for 128 notes). They claim doing in this way they can more easily model collapse by detecting if any generated note is outside C4 and B5. Also they did not put another dimension for silence as their training data does not have any.
is the number of time steps in a bar. In the implementation it is set to be 16 which means the most granular note is a 16th note. If there is pause then it simply extends the last note.
is the first convolution layer of discriminator and are hyper-parameters. By increasing them we can make the music that we generate closer to existing music.
The official code provided by the authors can be found . The implementation is written in TensorFlow. The code used the structure of the with modified Discriminator and Generator which takes conditioning into consideration