Single Voice Training and Synthesizing using WaveNet

Submitted by hollygrimm on Mon, 04/02/2018 - 14:49
Sylvia Plath Generated Waveform

Using WaveNet, a deep neural network, I was able to synthesize a ten second clip of Sylvia Plath's voice. WaveNet was trained without text sequences, so the generated speech is gibberish:

 

Dataset

The network was trained on 1000+ audio clips from 80 minutes of poetry spoken by Sylvia Plath. Here is an example of one of the clips:

 

To create the clips, I used Audacity to break up the ~30 minute MP3 files into smaller clips using "Sound Finder":

I then listened to clips below 30k in file size, and deleted any clips that are silent.

 

Training

The batch size was set to 2 for a 2GB GPU. There was a lot of jitter, but the loss continued to descend after 500k steps:

loss chart

 

The Generated Waveform

 

The Code

Github Repository

 

References:

  1. WaveNet: A Generative Model for Raw Audio (Blog Post) [deepmind.org]
  2. WaveNet: A Generative Model for Raw Audio (Paper) [arxiv.org]
  3. Fast Wavenet Generation Algorithm (Paper) [arxiv.org]