Contents:
This new text-to-speech AI model from Microsoft can listen to a voice for just a few seconds, then mimic it, including its emotional tone and acoustics. Microsoft’s latest research in text-to-speech AI centers on a new model known as VALL-E.
It’s the latest of many AI algorithms that can use a recorded voice to construct words and sentences that a person never spoke – and it’s remarkable for how little audio it needs to generate an entire voice. While there are already multiple services that can create copies of your voice, they usually demand substantial input. For example, 2017’s Lyrebird algorithm from the University of Montreal needed a full minute of speech to analyze, while Vall-E needs just a three-second audio snippet.
How Was This Achieved?
The researchers trained VALL-E on 60,000 hours of English language speech from 7,000-plus speakers in Meta’s Libri-Light audio library. Some of the results are tinny machine-like samples, while others are surprising realistic.
In order to mimic the voice, the target speaker’s voice must be a close match to the training data. This way, the AI can use its ‘training’ to try to mimic the target speaker’s voice to read aloud a desired text. Furthermore, the model not only mimics the pitch or husk or texture but also the emotional tone of the person speaking as well as the acoustics of the room. Which means that if the target voice has a disturbance, VALL-E will also include the disturbance in the result.
Microsoft claims its model can simulate someone’s voice from just a three-second audio sample. The speech can match both the timbre and emotional tone of the speaker – even the acoustics of a room. It could one day be used for customized or high-end text-to-speech applications, but like deepfakes, there are risks of misuse.
Still, VALL-E has not yet reached perfection. There are still a few technical limitations, such as words that occasionally get duplicated, or just come out incomprehensibly. Besides, data training worth 60 hours of audio might sound like a lot, but it is still insufficiently diverse, especially when it comes down to different accents and tones.
Here Come the Risks
Microsoft isn’t making the code open source, most likely due to the inherent risks. However, this is just for the time being. If (or better said when) AI can be accessed by anyone in the future, it will certainly raise serious social, security and ethical concerns because of the scams that could arise.
Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating.
The experiments were conducted under the assumption that the user would agree to be the target speaker in speech synthesis. As the researchers themselves point out, if the model is generalized to unknown speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.
If you liked this article, follow us on LinkedIn, Twitter, Facebook, Youtube, and Instagram for more cybersecurity news and topics.