Single Voice Training and Synthesizing using WaveNet

Submitted by hollygrimm on Mon, 04/02/2018 - 14:49
Sylvia Plath Generated Waveform

Using WaveNet, a deep neural network, I was able to synthesize a ten second clip of Sylvia Plath's voice. WaveNet was trained without text sequences, so the generated speech is gibberish:



The network was trained on 1000+ audio clips from 80 minutes of poetry spoken by Sylvia Plath. Here is an example of one of the clips:


To create the clips, I used Audacity to break up the ~30 minute MP3 files into smaller clips using "Sound Finder":

I then listened to clips below 30k in file size, and deleted any clips that are silent.



The batch size was set to 2 for a 2GB GPU. There was a lot of jitter, but the loss continued to descend after 500k steps:

loss chart


The Generated Waveform


The Code

Github Repository



  1. WaveNet: A Generative Model for Raw Audio (Blog Post) []
  2. WaveNet: A Generative Model for Raw Audio (Paper) []
  3. Fast Wavenet Generation Algorithm (Paper) []