Manipulating the soft palate, hard palate, larynx, and tongue are just a few of the different ways humans can craft sound. Because of these organs’ distinct characteristics, imitating their particular properties to create convincing artificial human voices is a difficult task. Researching different methods of speech synthesis, humans have created vocal pianos, computers, and artificial intelligence, each with different strengths and drawbacks. While artificial voices’ potential effect on humanity, ethical background, and future direction are important to consider, looking back on the developments that helped bring us here and experiencing the simulation of voices gives us a new perspective. With that in mind, let’s start at the beginning.
Vocal synthesis has existed for hundreds of years, starting with the physician Christian Kratzenstein’s vowel organ. Created in 1780, this machine used air to vibrate reeds and create vowel sounds modeling human mouth mechanics. While the organ’s sound may seem uncanny to some, it was the first step in the long journey to create a mechanical speech replacement. This invention was followed by several other wind operated machines that generated different results and more options for produced sound. The most notable of these was Joseph Faber’s Euphonia in 1845, a machine capable of mimicking most sounds in English, French, and German. The Euphonia was used for performance, able to sing in an uncanny manner while also displaying an accurate and impressive showcase of the potential for speech synthesis.
In the realm of electronics, the first breakthrough was Bell Telephone Laboratory’s Voice Operating Demonstrator, or Voder for short. This machine was the most similar to the Euphonia, allowing the user to control the machine’s speech with a keyboard. Additionally, the machine used a complex combination of speech filters, a technique that would change vowel and consonant sounds to allow for more inflection options. The Voder provided a more accurate speech synthesis model than the Euphonia, but was difficult to control and understand.
The next leap in creating speech was a digital synthesizer created by Max Mathews on the IBM 7094 Data Processing System in 1959. This computer was the first to gain the ability to talk and performed the song “Daisy Bell,” shocking the world with this new digital evolution. This performance has since become a culturally significant proof of evolving technology, inspiring a scene in the movie 2001: A Space Odyssey.
Jumping ahead to 2004, the first commercially successful enterprise of speech synthesis was created in the form of virtual singer personalities. The studio Zero G released two “Virtual Soul Vocalists,” synthesizer singing talents called Leon and Lola, for purchase and private use. Using these synthesizers, producers could create music with virtual voices, allowing for the mass production of music using the same “singer.” While these first two talents didn’t achieve widespread popularity, the proof of concept of their synthesizer-based, crowd-sourced music production model spurred the Japanese company Yamaha to create their own Vocaloid idols. Yamaha’s own series of Vocaloid virtual singers, created in 2005, is one of the most famous examples of synthesizer musical talents. Most of these synthesizer singers use “voice banks,” some thousands of recorded sounds by a voice provider, allowing for an expansive range of speaking and singing. The most famous of Yamaha’s virtual singers, Hatsune Miku, is voiced by Saki Fujita and is credited with singing over 100,000 songs. Additionally, as new voice samples are published for use, the vocals of these singers are continuously improving.
This model of reusing and combining audio clips to create music is commonly used in text-to-speech (TTS) programs. Apple’s famous Siri program TTS used voice actress Susan Bennett’s voice to model the first iteration of their Siri program in 2011, using audio clips to concatenate into accurate sentences and provide understandable speech. The same goes for famous physicist Stephen Hawking’s artificial speech device. While his illness prevented him from moving his muscles to speak, what he could still control was used to provide text input to a machine that gave him his famous voice. In this way, artificial voice technology has been key in unlocking independence for patients who lost major functions of their body.
However, the current stage of speech synthesis has mostly moved past this synthesis model. New developments in machine learning have attempted to overcome issues of unnatural-sounding voices when unrelated words are strung together in sentences. This has led to the controversial process of training models using people’s voices, enabling impersonations of these people through “deep fakes.” This technology has inspired a new phenomenon of ironic impersonations of famous people, including past and current presidents Donald Trump, Joe Biden, and Barack Obama, who star in fake videos about ridiculous topics such as arguments over video games and snack foods.
The emergence of new modes of expression through human speech creation is something never seen before. This technology enables deceit and dishonesty, but provides a voice and an opportunity for self-expression to those who are unable to speak. The relative infancy of this technology makes it impossible for us to know its true nature, so whether it will be used to build up or break down society is unknown. Imminent societal collapse notwithstanding, I recommend everyone to listen to the IBM’s performance of “Daisy Bell;” it shows how far we’ve really come.