We can all recall the dull and monotonous voices of machines that we were all forced to listen to when text-to speech (TTS) technology and Artificial Intelligence (AI) were new. Who hasn’t experienced a voice from machine that was so boring that it could put us to sleep if we were subjected to it long enough? Lucky for us, with the dawn of neural TTS these dull machine voices are becoming a thing of the past. Also, these technological improvements have allowed companies like Vozy to give us high-quality products including virtual assistants that can speak to us in numerous Spanish accents.
Improvements made to Text To Speech Technology
Understandably, one of the first aspects of text to speech and AI technology that companies sought to refine was the sound of the machines’ voices. Many users found the sound of these voices to be unpleasant and tiring. For this reason, technology companies began seeking to replace their standard TTS models with neural TTS models. This technology has developed machine voices that are more natural-sounding and enjoyable to listen to. Studies have been done where participants have been asked to listen to recordings produced using neural text to speech and recordings produced using human actors. The majority of participants in these studies have said that they enjoyed listening to the neural text to speech produced recordings as much as as the recordings made with human actors.
The Process of Neural Text to Speech
These more realistic sounding voices were accomplished by making some changes to the processes that computers use to turn text into speech. Vozy, as a leading technology company, has thrown its hat into the neural TTS game. Generally speaking, the process of turning text into sound has been simplified. The previous standard models split text into smaller units and start piecing together different audios for each unit according to the preceding or following units. These models require a huge dataset of audios for each unit to correctly represent a transition. As we can see, these models were long and complicated. In contrast, the process with neural text to speech is more concise. The steps are as follows: the text is first put into the system, sent to an acoustic generator, then sent to an acoustic vocoder, and finally, the sound is produced. An additional benefit to this streamlined method of neural TTS is that the computer can interact with the user in almost real-time.
One exciting aspect of neural text to speech and AI technology is the possibility to teach machines to adapt to new speaking styles more quickly than a human could learn them. With the neural model, it takes a machine a few hours to learn a new speaking style. In contrast, it would take a human actor a significantly longer time to learn the same new speaking style.
What Vozy is Doing with Neural Text to Speech Technology
Naturally, these improvements in Text-to-Speech technology have led us to the development of machines that can speak in different accents. This technology has previously been available in languages such as English. However, Vozy is the only company in Latin America that gives us neural text to speech technology in Spanish. We currently offer this technology in more than eight Spanish accents, including Colombian, Mexican, Argentine, Chilean, Peruvian, Puerto Rican, and Venezuelan. As everyone knows, the way that words are pronounced in the same language can vary depending on where the speaker is from.
Here you can listen NPS demo with an Argentine accent.
For example, imagine the difference between speaking to a person in Boston and speaking to a person in Los Angeles. If we asked the person in Los Angeles, “Where did you park your car?”, he or she may reply, “I parked my car in the parking lot.” If we asked the individual from Boston the same question he or she may reply phonetically, “I pahked my cah in the pahking lot.” If both of these individuals had a device that was enabled with neural text to speech their devices would speak to them in their respective accents. The same is also true for businesses that use machines enabled in other languages such as Spanish. For example, with neural networks technology, it is now possible for a Mexican business to have a virtual assistant field phone calls to its customers with a Mexican accent. Same applies to a Colombian, Argentinian or any other Spanish-speaking company.
Here you can listen NPS demo with a Colombian accent.
Vozy’s Neural TTS Process
The process that we use to provide our customers with a virtual assistant that speaks in different Spanish accents is similar to the process that big companies have used. Machine learning is used in order to convert text encoded as a string of characters to a sequence of cepstrum coefficients. Then the cepstrum coefficients are converted into a continuous audio signal by a neural vocoder. Vozy’s AI neural TTS technology levels the playing field for Spanish speaking people in Latin
America. Beginning with our virtual assistants there are a plethora of ways that this technology can make our customers technological experience more enjoyable. With the help of Vozy’s technology, Spanish-speaking companies can now take advantage of these improvements in Text-to-Speech models while using machines that speak in their accents.
Here you can listen NPS demo with a Mexican accent.
The creation of neural models has ushered in a new age of text to speech technology. Now the voices that are emitted from machines powered by neural TTS are so lifelike that we would sometimes prefer to listen to them as opposed to actual people. Vozy is at the forefront of this new digital era and is committed to bringing our customers all over Latin America, high-quality products powered by neural TTS that will allow them to engage with their clientele in their specific Spanish accents.