WaveNet
A Generative Model for Raw Audio
Last updated
A Generative Model for Raw Audio
Last updated
WaveNet() is a neural network architecture published by DeepMind at Sep 2016. The work is inspired by generative models such as Pixel CNN() and Pixel RNN() on generating images and text.
The model is mainly tackling the task of text-to-speech (TTS), which takes text as input and generate human-like speech. The novelty of WaveNet is that it uses a parameterized way to generate raw speech signal in comparison to the by-the-time state-of-the-art , where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances.
Such an network architecture can be also applied to music generation, which is main focus of this book. It can generate unconditional music and conditional music. Of particular interest are conditional music models, which can generate music given a set of tags specifying e.g. genre or instruments.
The main ingredient of WaveNet are causal convolutions(See Figure 1). The word "causal" is used in the sense that the prediction emitted by the model at timestep cannot depend on any of the future time steps .
Because models with causal convolutions do not have recurrent connections, they are typically faster to train than RNNs, especially when applied to very long sequences.
There is a concept in the context of neural network called receptive field, specific to each node, is the spatial connectivity that this node can reach. Take the below image as example, the top right of output node can reach 5 elements in the input. For modeling the high frequency raw audio(44.1 kHz is 44.1k samples per second) we may want a large receptive field to cover long enough history.
One of the problems of causal convolutions is that they require many layers, or large filters to increase the receptive field. To solve this problem WaveNet uses dilated convolution layers, this is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step. It is equivalent to a convolution with a larger filter derived from the original filter by dilating it with zeros, but is significantly more efficient.
Normally people will add RELU as activation function to the output of a convolution layer.
As described in the below figure, suppose there are k layers, each layer builds on top of the last one. For instance in Figure 2 we have 5 layers. For each layer, the input is the output of last layer, a dilated convolution is applied on the input to produce the z. Then it splits into two ways. For one way, a 1x1 convolution layer is applied on z and then a element-wise sum with input will produce the output of this layer, which is also the input of next layer. The other way is also to apply a different 1x1 convolution, but this time we add up all the output of this 1x1 convolution across different layers, then go through RELU->1x1 convolution->RELU->1x1 convolution->Softmax, which is shown in the right part of Figure 2 to form the final output of this example case.
WaveNet can model additional input other than raw audio. The conditional distribution now becomes
There are two types of conditioning:
Compare to the original paper, the small differences and tricks are:
Local Conditioning is not implemented
Separate 1x1 convolution layer for the skip connection and output
There are also some tricks
Silence crop on the audio file
Multi-threading on I/O
Smart way to do dilated causal convolution.
One-hot encoding on the sample input
Next an optional one-hot encoding is applied to each 0-256 element so that input will reshaped to (batch_size, ?, 256) for the network.
Let's ignore the global conditioning and focus on the core part of WaveNet.
This implementation follows the network defined in Figure 3. Let's start with line 14, we see the output is append to the outputs and current layer is reassigned with returned value from self._create_dilation_layer. The output here, which refers to the skip connection in Figure 3, will be added up at line 34 and go through RELU->1x1 conv->RELU->1x1 conv.
Let's further see how self._create_dilation_layer works. This is the left part of Figure 3.
Note that this function has two outputs, one is the skip_output, that will be added across different layers at the end, the other is dense output(with input_batch added as residual), which is the input of next layer. Because of the dense and skip 1x1 conv filter are set to have out_channel fixed, they can add up and keep as all layers produces same dimension output.
The magic part of this function is causal_conv. To understand what it does, let's look into the code
We can see causal_conv is composed from 3 parts, namely the time_to_batch, tf.nn.conv1d and batch_to_tome. Let me explain them one by one.
Inverse of time_to_batch
The function first calls time_to_batch to create input for convolution, then call tf.nn.conv1d to do convolution operation and finally call batch_to_time to bring the 1st dimension back to batch_size.
About conv1d:
Van Den Oord, Aäron, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. "WaveNet: A generative model for raw audio." In SSW, p. 125. 2016.
Borovykh, Anastasia, Sander Bohte, and Cornelis W. Oosterlee. "Conditional time series forecasting with convolutional neural networks." arXiv preprint arXiv:1703.04691 (2017).
In the paper it uses dilation of layers. For example, each of 1,2,4,...,512 will have a receptive field of 1024 and can be an much more efficient than using convolution counterpart. Stacking these blocks will further enlarge the receptive field.
WaveNet uses softmax to model the conditional distributions . The reason using a categorical distribution even for continuous audio data is that a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions about their shape.
Typically raw audio is stored in 16-bit integers for each time step which means our softmax layer will have 65,536 classes. To make this tractable, the author applied - law, which is a to encode raw audio. This reduces the number of classes to predict down to 256.
, where is denotes the filter, is layer index and is convolution operator.
In WaveNet, inspired by , the authors suggest using a Gated Activation instead of RELU as this yields significantly better result for modeling audio. This is denote by:
Here denotes gate and is the element-wise multiplication operator.
The idea of using Residual Learning comes from Residual learning is tackling the degradation problem which states that with the network depth increasing, accuracy gets saturated and then degrades rapidly. Surprisingly this is not due to overfitting as the training error also decreases. To illustrate this, consider we have a shallow network and deep network, the training error of deep network should never fall below its shallow counterparts as we can always just make the added layer identity to yield same result as the shallow one. That is to say the standard back-propagation has hard time to learn the identity mapping.
So in ResNet the model is like this: suppose the desired mapping that we want is . Instead of optimizing against we let our stacked non-linear layers to optimize over . In the ResNet paper it hypothesized that it is easier to learn this residual mapping in comparison to identity mapping.
Global Condition: remains constant for all ) e.g. a speaker embedding in a TTS model. The corresponding z now becomes . Note here it is matrix multiplication instead of convolution as is constant.
Local Conditioning: is not constant across all time steps, could be one value in , then different value for , which usually have a lower sampling frequency than the audio. In the paper the authors first use transposed convolution(sometimes also called deconvolution) to upsample to have same resolution as audio, assume this new time series is . Now z becomes . Note here we have 1x1 convolution on the upsampled .
Here I use a tensorflow implementation from ibab. The code can be found in his Github(). This implementation is relatively complete and well tested. In the follow-up section, I will highlight some key part of code. The model is trained on , which is a speech dataset recording from 109 native English speakers. The global conditioning for this dataset is the category.
For a wav file that has 44.1kHz sample rate, which is 44.1k samples per second, a 70-second file will have size sample points. Librosa will read each point as a real value between 0 and 1. The first thing after we read the audio sample is to apply - law encoding which make this real value an integer between 0 and 256(configurable). The audio now has shape (batch_size, ?, 1) where ? is the length of audio and 1 is the channel.
Ibab TensorFlow implementation: