The Rise of the Talking Machines: A Journey Through the History of Text to Speech Technology — Part 1

Published in

Analytics Vidhya

12 min readMar 8, 2023

As technology advances, so too does the way we experience literature. Long gone are the days when the only way to enjoy a good book was to curl up with a physical copy and read it from cover to cover. Now, thanks to text-to-speech technology, we have a new way to immerse ourselves in the written word. Text-to-speech narration allows us to listen to our favorite books, articles, and even blog posts, transforming the reading experience and opening up new avenues for literary exploration. In this blog, we’ll dive deeper into the world of text-to-speech literature, exploring its history, its impact, and its potential for the future.

Exploring the Diverse Applications of Text-to-Speech Synthesis (TTS) Technology

Text-to-Speech Synthesis (TTS) technology converts written text into spoken words. This includes the process of recording and reproducing pre-recorded vocals for playback. With significant advancements in synthetic speech quality, TTS has expanded its application areas beyond the traditional use cases. From speech-enabled websites to assistive communication devices for speech-disabled individuals (commonly known as AAC Devices) and digital talking books, a wide range of TTS-related solutions are now readily available in the market.

Text-to-Speech (TTS) tools are particularly beneficial for individuals who are visually impaired. Vision impairment can be one of the most challenging disabilities to manage, as it involves the loss of the sense of sight. For those with vision loss, even everyday tasks such as using the internet can be a struggle. The lack of tools that enable spontaneous engagement can lead to social isolation, especially within the online community. Fortunately, technological advancements have made it possible for the blind to access the internet through Text-to-Speech programs, helping to bridge the gap between the visually impaired and the rest of society.

In the classroom, students with reading difficulties can face challenges when it comes to print resources such as books and handouts. These challenges arise due to their difficulty in reading and comprehending words on a page. However, the combination of Text-to-Speech (TTS) technology with digital text has been shown to effectively eliminate these obstacles. TTS also offers a multimodal reading experience for students, allowing them to see and hear text while reading. Research has demonstrated that this approach enhances word recognition, increases reading comprehension, and helps students to pay attention and retain information. Additionally, TTS technology has been found to improve students’ ability to persevere through reading tasks and enables them to focus on understanding the material rather than struggling to sound out words.

One of the most intriguing applications of Text-to-Speech (TTS) technology is in the realm of digital storytelling. Traditional human storytellers utilize their voices in numerous ways to captivate their audience, from impersonating character voices to using sound effects and applying prosody to convey emotions and create a mesmerizing listening experience. In digital storytelling, a software tool reads the text aloud on a digital screen, bringing the story to life. This technology is versatile and can cover a wide range of topics, such as explaining a concept, sharing personal experiences, recounting historical events, or delivering a children’s story to a young audience. Ideally, a digital storytelling software should provide an engaging experience that rivals that of a human storyteller. To achieve this goal, an automated Text-to-Speech synthesis system needs to offer a more expressive and engaging speaking style.

The Evolution of Text-to-Speech Technology: From Its Origins to Modern Applications

Efforts to create machines capable of mimicking human speech date back to the 12th century. The first computer-based voice synthesis system was introduced in the 1960s, and since the early 1990s, various computer operating systems have included voice synthesizers. Initially, computer-based speech synthesis relied on techniques such as arctural synthesis, formant synthesis, and concatenative synthesis. However, with the advancements in statistical machine learning, the approach of statistical parametric speech synthesis (SPSS) emerged, which involves predicting parameters such as spectrum, fundamental frequency, and duration for voice synthesis. In recent years, neural network-based speech synthesis has gained popularity and has become the most common method due to significant improvements in voice quality. (Tan et al., 2021; Wikipedia, 2022)

Articulatory Synthesis

Articulatory synthesis is a voice synthesis method that mimics the movements of human articulators, including the lips, tongue, glottis, and vocal tract, to generate speech (Palo, 2006; Coker et al., 1976). Because this technique closely resembles how humans produce speech, it has the potential to be the most effective voice synthesis method (Whalen, 2003). However, implementing articulatory synthesis in practice is a challenging task due to the complex modeling of articulator behavior, and obtaining data for simulation is difficult. As a result, the speech quality produced by articulatory synthesis is often inferior to that of more recent voice synthesis techniques (Tan et al., 2021).

Formant Synthesis

Formant synthesis, introduced by Seeviour, Holmes, and Judd in 1976, relies on a set of linguistic rules that govern a simplified source-filter model to produce speech that closely matches the formant structure and spectral characteristics of human speech. An additive synthesis module and an acoustic model with adjustable parameters, such as voicing, noise level, and fundamental frequencies, are employed to generate speech. Formant synthesis is capable of producing high-quality speech without requiring extensive human speech corpus or significant computational resources. However, the synthesized speech often sounds artificial and contains artifacts. Defining synthesis rules is a challenging task in general (Tan et al., 2021). Formant synthesis has two structural forms: cascade and parallel. Combining these two architectures can improve performance, as suggested by Lukose et al. in 2017.

Concatenative Synthesis

Concatenative synthesis generates new words and sentences by piecing together speech fragments stored in a database. These fragments are created by breaking down recorded sentences by voice actors into words, syllables, demi-syllables, phonemes, and diaphones (Lukose et al., 2017). To create speech, this method searches for speech units that match the input text fragment and then concatenates these units to produce a speech waveform. Concatenative synthesis is typically used in systems where only a few items are spoken, as it provides the most natural and intelligible output that closely resembles that of the original voice actor. However, concatenative text-to-speech (TTS) requires a vast recording database to cover all possible combinations of speech units for spoken words. Additionally, the generated voice may sound less natural and emotional, as concatenation can cause a loss of smoothness in stress, emotion, prosody, and other areas (Tan et al., 2021).

However, during the 1990s, a new type of concatenative synthesis called “unit selection” (Black and Campbell, 1995) gained attention due to ongoing difficulties in eliminating artifacts caused by prosodic changes and spectral interpolation at unit joins. In unit selection speech synthesis, every synthesized unit of various inventory is obtained from a large-scale corpus based on specified cost functions and combined with minimal signal processing. When a sufficiently large corpus is available, this method produces speech with fewer artifacts. However, because the synthetic speech’s voice identity heavily relies on that of the speech corpus, unit selection lacks flexibility in generating other types of timbres or speaking styles.

Statistical Parametric Speech Synthesis

Statistical parametric speech synthesis (SPSS) (Zen, Tokuda and Black, 2009) has emerged as a competitive alternative to concatenative approaches. Instead of relying on pre-recorded speech fragments, SPSS produces speech using statistical models of speech, overcoming the main disadvantage of concatenative systems, which is their lack of flexibility. These statistical models capture how speech evolves over time in the context of a given input text and are trained using machine learning techniques on voice corpora (Aalto University, 2020). The basic idea is to generate acoustic parameters that produce speech, and then retrieve speech from the generated acoustic parameters, instead of concatenating waveforms to produce speech (Morise, Yokomori and Ozawa, 2016). SPSS consists of three main components: a text analysis module, a feature generation module (acoustic model), and a waveform generation module, as shown in the figure below (Tan et al., 2021) (Zen, Tokuda and Black, 2009).

Image by author : A schematic view of an SPSS system

Text analysis module

This module is responsible for extracting the linguistic features such as phonemes, part of speech tagging and duration from various granularities via text processing techniques (Tokuda et al. 2013) such as ;

Text normalization — converting written text into spoken word forms (eg: 1990 — nineteen ninety) (Sproat and Jaitly 2016) (Lan, Shunan and Kangping 2020) (Sproat et al. 2001),
Word segmentation — detecting word boundary from raw text (Nianwen 2003),
Part-of-speech (POS) tagging — The POS of each word, such as noun, verb, preposition is detecting within the text (Schlunz 2010) (Sun and Bellegarda 2011),
Prosody prediction — rhythm, stress, intonation of speech, loudness and pitch is detected from the text (Chu and Qian 2001), and
Grapheme-to-phoneme conversion — Converting character (grapheme) into pronunciation (phoneme).

Feature generation module (parameter prediction module/acoustic models)

SPSS systems are typically composed of models that utilize parameters to describe speech. The parameters are based on statistical measurements such as means and variances of probability density functions, which are used to represent the distribution of parameter values observed in training data. To create these models, linguistic and acoustic features (such as fundamental frequency, spectrum or cepstrum) are paired together and used as training data. These acoustic features are obtained through vocoder analysis of speech. According to Tan et al. (2021), there are several reasons why constructing these acoustic models is important.

Additional context data as input;
Modeling the association between output frames;
Overcoming the difficulty of over-smoothing prediction, because the mapping between language and auditory elements is one-to-many.

Hidden Markov Models (HMM) have been widely used and shown to be highly effective in generating speech parameters. While other generative models can also be used, HMM has been demonstrated to produce the best results in the past, as stated by Zen, Tokuda, and Black in 2009. For instance, Yoshimura et al. (1999) used HMM to generate speech parameters, while Tokuda et al. (2013) proposed a methodology that employs spectral parameter vectors such as mel-cepstral coefficients (MCC) and F0 for the observation vectors of HMM. Compared to previous TTS techniques, speech synthesis using HMMs offers greater flexibility in terms of changing speaker identities, speaking styles, and emotions, as reported by Tan et al. in 2021. This approach has been extensively researched since the mid-1990s.

HMM-based SPSS suffers from a major drawback, which is the insufficient quality of synthesized speech. This is mainly attributed to the low accuracy of acoustic models and predicted acoustic features that lack detail due to being over-smoothed, as well as poor vocoding techniques. The accuracy of the model significantly affects the quality of the synthesized speech since speech characteristics are generated from its acoustic models. To address this issue, researchers have explored several advanced acoustic models and training frameworks to improve the accuracy of modeling. These include trended HMMs (Dinesh and Sridharan 2001), stochastic Markov graphs (Eichner et al. 2001), hidden semi-Markov models (HSMMs) (Zen et al. 2004), trajectory HMMs (Zen, Tokuda and Kitamura 2004), minimum generation error (MGE) criterion (Wu and Wang 2006), and variational Bayesian approach (Hashimoto et al. 2009).

In acoustic modeling for SPSS, contextual factors such as phonetics and linguistics are crucial. However, due to the approximately 50 types of context in a typical system, effectively modeling these complex context dependencies is challenging. In HMM-based SPSS, the standard technique to handle contexts is to use a separate HMM model for each context combination, known as a context-dependent HMM. However, due to limited training data, successfully estimating all context-dependent HMMs may not be feasible. To overcome this, top-down decision tree-based context clustering is commonly utilized, as reported by Odell in 1995. While the decision tree approach is effective, it tends to overfit when dealing with complex context dependencies. To address this issue, a DNN was used, as reported by Heiga, Senior, and Schuster in 2013. This approach outperformed HMM-based SPSS systems with similar numbers of parameters, but the computational cost of DNN-based systems was higher than that of HMM-based systems. DNN employs a layered hierarchical framework to transform linguistic text input into its final speech output, mimicking human speech creation. DNN can model complex mapping functions and effectively represent high-dimensional and correlated features.

Despite the advantages of DNN-based modeling, it still poses challenges in incorporating the long-duration contextual impact of a speech utterance due to its inherent feed-forward structure. To address this, dynamic model parameters must be combined with their static counterparts to construct smooth parameter trajectories and synthesize smooth speech parameter trajectories. In 2014, Fan et al. proposed the use of Recurrent Neural Networks (RNN), particularly with bidirectional Long Short Term Memory (LSTM) layers, as a generation model for TTS synthesis. RNNs can capture sequential information from anywhere in the feature, making them suitable for incorporating long-duration context. The Hybrid BLSTM-RNN and DNN system proposed by Fan et al. outperformed HMM and DNN in capturing deep information in a text.

Wave form generation module(vocoder analysis/synthesis module)

This module uses acoustic features to synthesize speech waveform from the predicted. STRAIGHT (Kawahara 2006) and WORLD (Morise, Yokomori and Ozawa 2016) are popular vocoders in SPSS systems.

Statistical parametric speech synthesis has a broad range of advantages over concatenative speech synthesis. Some of them are;

1. Easily modifiable voice characteristic, by changing the model parameters in SPSS, we may readily change voice features, speaking styles, and emotions.;

2. Emotional speech can be synthesized with limited speech corpora, around 10 minutes of phonetically balanced single speaker database is enough to build a HMM-based speech synthesis system (Zen, Tokuda and Black 2009);

3. Supports multiple languages, because only the contextual aspects that should be used differ depending on the language (Zen, Tokuda and Black 2009).

However due to distortions such as muffled, buzzing, or noisy audio, the resulting speech from SPSS is less understandable and it’s still a robotic sound.

Conclusion

In this part of the blog I have disscussed the various techniques used in speech synthesis, including articulatory synthesis, formant synthesis, concatenative synthesis, and statistical parametric speech synthesis (SPSS). Articulatory synthesis closely mimics human speech but is difficult to implement, while formant synthesis uses linguistic rules to produce high-quality speech but can sound artificial. Concatenative synthesis pieced together speech fragments to produce natural and intelligible output, while SPSS uses statistical models to generate speech and is more flexible than concatenative synthesis. In conclusion, each technique has its own advantages and disadvantages, and researchers are constantly striving to improve speech synthesis technology to create more natural and human-like speech.

As the technology continues to advance, we can expect to see even more impressive TTS models that can generate speech with greater naturalness and variability. In part 2 of this article, we will explore some of the latest developments in neural network-based TTS models and how they are pushing the boundaries of what is possible with synthetic speech. Stay tuned for more exciting insights into this rapidly evolving field.

References

WIKIPEDIA, 2022. Speech Synthesis. [online]. Available from: en.wikipedia.org/wiki/Speech_synthesis [Accessed 07 July 2022]

TAN, X. et al., 2021. A Survey on Neural Speech Synthesis. arXiv preprint arXiv. abs/2106.15561.

COKER, C.H.. et al., 1976. A Model of Articulatory Dynamics and Control. Proceedings of the IEEE, 64(4), pp. 452–460.

WHALEN, D.H., 2003. Articulatory synthesis: Advances and prospects. Proc. 15th International Congress of Phonetic Sciences (ICPhS’03), 1, pp. 175–177.

SEEVIOUR, P., HOLMES, J. and JUDD, M., 1976. Automatic generation of control signals for a parallel formant speech synthesizer. ICASSP ’76 IEEE International Conference on Acoustics, Speech, and Signal Processing,1, pp. 690–693.

LUKOSE. et al., 2017. Text to speech synthesizer-formant synthesis. 2017 International Conference on Nascent Technologies in Engineering (ICNTE), 1, pp. 1–4.

Black, A.W. and CAMPBELL,N., 1995. Optimising selection of units from speech databases for concatenative synthesis. Eurospeech, 1.

ZEN, H., TOKUDA, K. and BLACK, A.W., 2009. Statistical parametric speech synthesis. Speech Communication,51(11), pp. 1039–1064.

AALTO UNIVERSITY, 2020. Statistical parametric speech synthesis. [online]. Available from: https://wiki.aalto.fi/display/ITSP/Statistical+parametric+speech+synthesis[Accessed 14 July 2022].

MORISE, M.,YOKOMORI, F. and OZAWA, K., 2016. WORLD: A Vocoder-Based High- Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7), pp. 1877–1884.

TOKUDA, K. et al., 2013. Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE, 101(5), pp. 1234–1252.

SPROAT, R. and JAITLY, N., 2016. RNN Approaches to Text Normalization: A Challenge. ArXiv, abs/1611.00068.

LAN, H., SHUNAN, Z. and KANGPING, W., 2020. A Text Normalization Method for Speech Synthesis Based on Local Attention Mechanism. IEEE Access, 8, pp. 36202–36209.

SPROAT, R. et al., 2001. Normalization of Non-Standard Words. Computer Speech & Language,15(3), pp. 287–333.

NIANWEN, X., 2003. Chinese Word Segmentation as Character Tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1), pp. 29–48.

SCHLUNZ, G.I., 2010. The Effects of Part–of–Speech Tagging on Text–to–Speech Synthesis for Resource–Scarce Languages. Master’s thesis, North-West University, Potchefstroom Campus.

SUN, M. and BELLEGARDA, J.R., 2011. Improved POS Tagging for Text-to-Speech Synthesis. 2011 IEEE International Conference on Acoustics, 1, pp. 5384–5387.

CHU, M. and QIAN, Y., 2001 Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts. Int. J. Comput. Linguistics Chin. Lang. Process, 6.

KAWAHARA, H., 2006. STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoustical Science and Technology, 27, pp. 349–353.

YOSHIMURA, T. et al., 1999. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. Eurospeech ,5, pp. 2347–2350.

DINESH, J. and SRIDHARAN, S., 2001. Trainable speech synthesis with trended hidden Markov models. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. №01CH37221), 2, pp. 833–836.

EICHNER, M. et al., 2001.Speech synthesis using stochastic Markov graphs. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. №01CH37221), 2, pp. 829–832.