Bad Speech Synthesis Made Simple

Speech synthesis is a core component of many modern products - turn by turn navigation, automatic personal assistants/search like Siri and Alexa, and even the voices of AI/robots/people in computer games and movies are (in some cases) generated by speech synthesis systems. But how are these systems built? How can we make computers reply to our requests (and demands) using speech?

Even after years of studying signal processing, machine learning, and computer programming, I still had not found an answer to this question. After searching online for a bit, I stumbled across Festival, a tool for building speech synthesis systems. Exploring their online demo was a lot of fun and gave a nice sense of what is possible and available in the academic community (at least that I could find!).

After using and reading about the "old method" KAL voice, which is made by diphone/unit selection synthesis, I decided to make my own diphone based unit selection synthesizer based on a few papers: Diphone Collection and Synthesis, Lenzo and Black, Issues in Building General Letter to Sound Rules, Black, Lenzo, Pagel, and Letter to Sound Rules For Accented Lexicon Compression, Pagel, Lenzo, Black. This seemed to be the simplest possible approach, and though the results are not amazing they are good enough to tell that the system works - which is pretty amazing given its relative simplicity. You can also see an entertaining word and partial word level approach, that seems similar to what is described here at .

The papers linked above are quite readable, so be sure to check them out - my own code can be found here , and I will discuss some of the key pieces below. Here is a demo of the result on some classic Megadeth lyrics.

In [1]:
<iframe width="100%" height="166" scrolling="no" frameborder="no" src=";color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe>

Broad Strokes For Bored Folks

There are a few key pieces to diphone unit selection synthesis, or at least the approach I took.

  1. Text to phoneme alignment
  2. Phoneme to diphone creation
  3. Audio lookup and stitching

I will cover each of these steps in the following text, but first let's define some reference terms and discuss their meaning.

What is a phoneme?

A phoneme is considered an atomic unit of speech audio. Depending on which phoneset is used, there can be anywhere from 30-50+ phones, not counting inflection, accent and emotional variations. Recording all phonemes, or in our case diphone pairs, should make it possible (in theory, at least) to generate every word if we can manage to go effectively from text to phonemes and do good audio stitching. This approach can work quite well for English speech synthesis, but I think the approach is more difficult for other languages, especially ones with non Romance/Latin origins.

A list of two phonesets can be seen in the code directory, in the files cmu_phones.list and festival_phones.list. Greater detail about phones and how they are defined can be found in the linked papers above, but for now just consider these to be atoms of speech sound - we will stitch combinations of these together in order to make sound for a given text.

Why not call two phonemes a biphone?

I had this exact same question - according to my friend João Felipe Santos it has to do with origin of the word "phonemes". The term phoneme comes from the study of phonetics and phonology which was originally a Greek term (and study). Given that the Greek prefix for 2 is di- , rather the Latin bi- , we end up with the term "diphone" instead (ref).

When All The Planets Align

To begin our adventure, we can imagine wanting a tool that takes normal, human understandable words and sentences, and converts them into a series of sounds (phonemes) that would make sense. This task is complicated, and has many possible approaches. However, one of the most straightforward approaches is to use something called a pronunciation dictionary to look up which phonemes you need to pronounce each word.

There are a number of pronunciation dictionaries but one of the biggest/most complete dictionaries I have found is used by CMU Sphinx, an open source speech recognition toolkit. Sidenote - if you are interested in recognition, I have made an "all-in-one" wrapper to a light version of Sphinx, called pocketsphinx. You can find that code here

Continuing on our path for synthesis, we now have a human crafted lookup table to go from words to sounds. Unfortunately, we almost immediately run into a common problem in sequential modeling - the length of the text and the length of the pronunciation/phoneme sequence are not the same!

This normally requires somewhat complicated machine learning techniques, in order to learn a valid alignment and predict it. However in the papers I listed above (such as Letter to Sound Rules For Accented Lexicon Compression, Pagel, Lenzo, Black) the authors describe a fairly straightforward dynamic programming algorithm for inputting a new character ("_") in the predicted sequence which makes the mapping from text to extended phones 1:1. This makes the machine learning portion of this system much simpler.

For example in the original CMU dict we have lines such as:


After running the t2p script and generating alignments for these sequences, we see word and phoneme sequence pairs like this:


This aligned dictionary file is commited in the repo as cmudict.0.7a_SPHINX_40.align, which was generated from cmudict.0.7a_SPHINX_40. You can also try your own custom dictionary - I included copies of the relevant t2p scripts as well. Processing the alignments for this dictionary took about 10 minutes on my machine (Macbook Air 2011).

Creature Features

After processing the pronunciation dictionary to get alignments, we have about 138,000 aligned sequence pairs. The original paper trained a decision tree for unit selection using 5 character windows (n-gram character windows, $n=5$). This results in a feature window like the following:

Original string and phones:

BEATNIK ['B', 'IY', '_', 'T', 'N', 'IH', 'K']

After padding with 2 "-" characters (floor(n / 2))) to accomodate for beginning and end windows:


After feature extraction, we get the resulting features. Target phone to predict is the last entry, readable version in commas (formatted as features > target) with center character in the $(n = 5$) character window bolded:

['-', '-', 'B', 'E', 'A', 'B'] (--BEA > B)

['-', 'B', 'E', 'A', 'T', 'IY'] (-BEAT > IY)

['B', 'E', 'A', 'T', 'N', '_'] (BEATN > _)

['E', 'A', 'T', 'N', 'I', 'T'] (EATNI > T)

['A', 'T', 'N', 'I', 'K', 'N'] (ATNIK > N)

['T', 'N', 'I', 'K', '-', 'IH'] (TNIK- > IH)

['N', 'I', 'K', '-', '-', 'K'] (NIK-- > K)

In general this gives us a nice simple feature set in order to train a classifier modeling $p(y_i\>|\>x_{i-n/2} ... x_{i+n/2})$.

After reading the Pixel RNN paper and reviewing the Deep Learning book (Figure 10.4 and section 10.2.1), I have become interested in trying to train sequential decision models in parallel, using teacher forced (ground truth) data as a proxy for the sequential predictions. This means you can train in parallel, but apply recursively at test time to still get a sequential decision. A fully bidirection approach to this can be seen in Natural Language Processing From Scratch, Collobert, Weston, Bouttou, Karlen, Kavukcuoglu, Kuksa, but for now we will use a "left-to-right" approach since I find it easier to understand and work with.

Given this new fascination I added one extra wrinkle using "Markov features" (for lack of a better name) to make features modeling $p(y_i\>|\>x_{i-n/2} ... x_{i+n/2}, y_{i-1})$. Due to adding a $y_{i-1}$ feature, the phone sequence to predict was padded in front with one "SIL" phoneme at the very start, allong the decision tree to "learn to start". After making all the ngram + Markov features, there end up being ~900,000 samples generated from the original 138,000 aligned sequences. After this processing, the final features looked like the below example (remember, last entry is the target to predict). This is our training data for the classification algorithm which will map text to phonemes.

Padded string and padded phones:

--BEATNIK-- ['SIL', 'B', 'IY', '_', 'T', 'N', 'IH', 'K']

Resulting features (target variable is the last entry):

['SIL', '-', '-', 'B', 'E', 'A', 'B']
['B', '-', 'B', 'E', 'A', 'T', 'IY']
['IY', 'B', 'E', 'A', 'T', 'N', '_']
['_', 'E', 'A', 'T', 'N', 'I', 'T']
['T', 'A', 'T', 'N', 'I', 'K', 'N']
['N', 'T', 'N', 'I', 'K', '-', 'IH']
['IH', 'N', 'I', 'K', '-', '-', 'K']

These Markov features seemed to make a decent improvement for 1,000 and 10,000 samples in training, but it is unclear to me if that holds as more data is added, or if the Markov feature is simply acting as a bigger effective character window (since $y_{i-1}$ is based on one extra character backward in time). Either way, it is something I hope to explore further another time.

Decisions, Decisions, Decisions

I will not review the decision tree learning algorithm in depth, but most of my implementation is derived from Kevin Davenport and Patrick L. Le, who in turn derived from Programming Collective Intelligence, Toby Segaran. The primary reason I chose a pure Python approach, rather than using the (much superior) implementation in Scikit-Learn is so that I could serialize the tree once and save it in this repo, not worrying about possible API changes in future versions of sklearn. I was also curious how to serialize Python objects in pure json, so this was a nice experiment for that although jsonpickle seems a more robust and general solution.

At a high level, decision tree learning tries to split the data along every column (in my implementation, by trying each split). Testing the resulting 2 datasets for "purity", measured by the entropy of the labels in each half, our goal is to find splits where all the labels are the same in one of the halves. Choosing the candidate split which was the most pure, we recusively apply the same logic in each half until we reach splits with entropy 0, meaning all the labels in this split are the same - sometimes phrased as "the leaf is pure". This means that as long as there is valid 1:1 mapping in the feature set (the same set of features don't map to 2 different labels), a deep enough decision tree will be perfect on the training set.

Generally this may seem like a bad thing, since it means decision trees are easy to overfit (leading to algorithms like random forests, boosted trees, and so on which try to regularize the power of the tree) but in this context it is actually great - it means that we can put words we want to pronounce correctly in the training set, and the decision tree should be able to perfectly pronounce them. This makes the system easy to modify and extend, not to mention the fact that decision tree rules are easy to visualize (see the previous links) and potentially modify by hand using expert analysis.

Using my own implementation of the decision tree means the algorithm does not scale particularly well (though good implementations are quite scalable). We are only able to train on around 10,000 examples out of the potential 900,000 available, and I am sure adding more training data would greatly boost performance. On the other hand, if the results were too good I couldn't title this article "Bad Speech Synthesis".

Diphone Fanatics

Since the actual waveforms we want to look up in the provided database are diphone based, not phoneme based, we need to stitch the phoneme sequence into a valid set of diphones. This is probably the most custom code in the whole system, so I will walk through it in a bit of detail.

Starting at line 13 we read in the aligned dictionary and split into text sequences and phone sequences. Note that we also skip lines for alternatives, denoted blah(2), blah(3) and so on.

Next in lines 23 to 40 we read in the two phone lists for Sphinx and Festival, stripping them down to the common subset between each list, placing the results in sorted order with "SIL"(Sphinx) and "pau" (Festival) at the end.

After this we read in the diphone lookup list from kaldiph.est - this is the mapping from Festival dphones to actual wavfiles, along with start, middle, and end times for the diphone in question. This will be important when finally stitching the sound together.

The next step is to construct lookups for cmu2radio (since we are using CMU Sphinx and Festival's radio phoneset) as well as the reverse (radio2cmu). Once this is done we construct real pairs (pau-pau, t-t, and so on) as well as "fake pairs" - these are described in more detail in the papers, but basically some sequences of "phone1, _, phone2" can also be looked up directly and have a unique wavfile. I will describe how this is handled later - just know that "fake" is referring to things with "_" in them.

Once we have all these helper lookups and mappings, it is time to start constructing diphones. Broadly, there are a few steps in the process, which can be seen in the synthesize function. We step through the list one element at a time, looking ahead to the next element for some rules to try. Note that this logic is handcrafted by me, and there could be one or many bugs in the logic - or even a misunderstanding of how the "_" character should really function. One of the biggest questions I have is if this "backward continuation" in 1a makes sense. To track which real_match phones are not in the database, I also added a print statement of "No match found? p1 p2" which can be ignored in general.

  1. Is there a "_" character in the first position?

    a. If so, try to continue the second phoneme backwards (real_match(p2, p2))

    b. Take a step forward and continue the logic

  2. Is there a "_" character in the second position?

    a. Try to make a fake phone with the third phone (if possible), and check for a match

    b. If the fake phone didn't work, try to treat the second phone as a continuation (real_match(p1, p1))

    c. Take a step forward and continue the logic

  3. If there are no "_" characters at all

    a. Try a real match (real_match(p1, p2))

    b. Take a step forward and continue the logic

For each of these steps, if we find a match we add it to a list of wavfile chunks that will be stitched together. Once all the wav chunks have been gathered, we perform "overlap-add" processing in the time domain line 225 by adding up the waveforms in the time domain (taken from Wikipedia article on overlap add).

This is the simplest way to combine all the chunks into one continuous sound, but much better results can be obtained by performing overlap add in the frequency domain, performing frequency smoothing/interpolation, WSOLA (matching the overlap of pieces in time based on autocorrelation), using fancier representations such as vocoder parameters (LPC, STRAIGHT, WORLD), or tons of other tricks. This is just the minimal way to get something working - whole books can be written just on the signal processing used to combine individual sounds into continuous, smooth sounds. But once again if we did it too well, this post couldn't be "Bad Speech Synthesis".

Putting The Pieces Together Again

Finally we have all the parts to make bad speech. Because of the limited training, the model pronounces some words wildly incorrectly, and it has a lot of trouble with common short words (ME, THE, YOU). I hard coded a few quality of life replacements into the input processing such as line 301, but generally you can rewrite words in a more "pronouncible" way to get different sounds. I also added an additional emphasis on the very first and last syllables in line 124 by doubling the first and last phones, this generally improved sound quality to my ears but is not really justifiable otherwise.

The final result will look something like this:

python congratulations this is the final message

Predicted phones: ['K', 'OW', '_', 'NG', 'R', 'AE', 'CH', 'UW', 'L', 'EY', '_', 'SH', 'AH', 'N', 'Z']
No match found?
ng ng
Text: THIS
Predicted phones: ['_', 'TH', 'IH', 'S']
Text: IS
Predicted phones: ['IH', 'S']
Replacing THE -> THEEE
Predicted phones: ['_', 'TH', 'IY', 'IY', '_']
Predicted phones: ['F', 'IH', 'N', 'AH', 'L']
Predicted phones: ['M', 'EH', '_', 'S', 'IH', 'JH', '_']
In [2]:
<iframe width="100%" height="166" scrolling="no" frameborder="no" src=";color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe>

Into The Black Horizon

Some drawbacks to the type of system described above can be seen in this lecture. In general there is a sliding scale between composibility and tractable recording size as even word level recordings would require ~100k recorded words to capture the most common English words - not to mention nouns, slang, words derived other languages (foods, local sayings), and so on. However the longer sequences become the less stitching artifacts exist, and the more you are able to capture "natural" sound or avoid unnatural dynamics at the stitch boundaries.

Adding prosody features, emotional inflections, and other text preprocessing can be a huge boost, as well as performing true structured prediction on the phonemes (or even diphones) using something like a conditional random field (CRF), hidden Markov model (HMM), or recurrent neural network, instead of the cheap Markov features I used here. Audio representations for more natural splicing of sound, as well as introducing more special case diphones (or even expanding to triphones) are another way to improve these results. There are a huge number of ways to improve these types of systems, and it is a lot of fun to explore if you don't mind hearing weird robot voices once in a while.


In [ ]: