Single Voice Training and Synthesizing using WaveNet

Sylvia Plath Generated Waveform

Using WaveNet, a deep neural network, I was able to synthesize a ten second clip of Sylvia Plath’s voice. WaveNet was trained without text sequences, so the generated speech is gibberish:

Dataset

The network was trained on 1000+ audio clips from 80 minutes of poetry spoken by Sylvia Plath. Here is an example of one of the clips:

To create the clips, I used Audacity to break up the ~30 minute MP3 files into smaller clips using “Sound Finder”:

I then listened to clips below 30k in file size, and deleted any clips that are silent.

Training

The batch size was set to 2 for a 2GB GPU. There was a lot of jitter, but the loss continued to descend after 500k steps:

Loss chart after 529000 steps of training

The Generated Waveform

The Code

Github Repository

References:

WaveNet: A Generative Model for Raw Audio (Blog Post) [deepmind.org]
WaveNet: A Generative Model for Raw Audio (Paper) [arxiv.org]
Fast Wavenet Generation Algorithm (Paper) [arxiv.org]