A team of neuroscientists at the University of California, San Francisco used recorded brain signals from patients with epilepsy to program a computer to mimic natural speech, a breakthrough that could one day have a profound effect on the ability of certain patients to communicate. The results were published in the journal Nature.
The technology that translates brain activity into speech would be transformative for people unable to communicate as a result of neurological deficiencies.
Decoding speech from neural activity is a challenge because speech requires very precise and rapid control of articulators of the vocal tract.
Edward Chang of the University of California at San Francisco and his colleagues designed a neural decoder that uses sound representations encoded in brain activity to synthesize audible speech.
"Speech is an incredible form of communication that has evolved over thousands of years to be very effective. Many of us take it for granted how easy it is to talk, which is why losing that ability can be so devastating, "Chang said.
"For the first time, our study demonstrates that we can generate complete sentences based on an individual's brain activity."
The research is based on a recent study in which the team first described how speech centers in the human brain choreograph the movements of the lips, mandible, tongue, and other components of the vocal tract to produce fluent speech.
From this work, the researchers realized that earlier attempts to directly decode the speech of brain activity could have had limited success because these regions of the brain do not directly represent the acoustic properties of speech sounds, but the instructions needed to coordinate the movements of the brain. mouth. and throat during speech.
"The relationship between vocal tract movements and the speech sounds that are produced is complicated," said co-author Dr. Gopala Anumanchipalli, a speech scientist at the University of California, San Francisco.
"We reasoned that if these speech centers in the brain are encoding movements instead of sounds, we should try to do the same in the decoding of these signals."
In the study, neuroscientists asked five volunteers who were being treated at the University of California's Epilepsy Center in San Francisco – patients with intact speech who had electrodes temporarily implanted in their brains to map the source of their seizures in preparation for neurosurgery – for read several hundred sentences out loud as researchers recorded the activity of a region of the brain known to be involved in the production of language.
Based on the audio recordings of participants' voices, they used linguistic principles to reverse-engineer the vocal tract movements needed to produce these sounds: pressing the lips together, tightening the vocal cords, moving the tip of the tongue into the roof of the mouth . , then relaxing and so on.
This detailed mapping of sound to anatomy allowed the authors to create a realistic virtual vocal tract for each participant that could be controlled by their brain activity.
This comprised two "neural network" machine learning algorithms: a decoder that transforms patterns of brain activity produced during speech into movements of the virtual vocal tract and a synthesizer that converts these movements of the vocal tract into a synthetic approximation of the voice of the participant.
The synthetic speech produced by these algorithms was significantly better than the directly decoded synthetic speech of the participants' brain activity without the inclusion of simulations of the vocal traits of the speakers.
The algorithms produced sentences that were understandable to hundreds of human listeners in crowdsourcing transcription tests conducted on the Amazon Mechanical Turk platform.
As is the case with natural speech, transcribers were more successful when they received shorter word lists to choose from, as would be the case with caregivers who are prepared for the types of phrases or requests patients can present.
Transcribers accurately identified 69% of the synthesized words from the 25 alternative lists and transcribed 43% of the phrases with perfect accuracy.
With 50 more challenging words to choose from, the overall accuracy of the transcribers fell to 47%, although they could still perfectly understand 21% of synthesized sentences.
"We still have ways to perfectly mimic spoken language," said co-author Josh Chartier, a bioengineering student at the University of California at San Francisco.
"We are very good at synthesizing slower speech sounds like & # 39; sh & # 39; and as well as maintaining the rhythms and intonations of speech and the genre and the identity of the speaker, but some of the more abrupt sounds like 'e' and '# 39; it's a little confusing.
"Still, the accuracy levels we produce here would be an incredible improvement in real-time communication compared to what's currently available."
Researchers are currently experimenting with higher density electrode arrays and more advanced machine learning algorithms, which hope to further enhance synthesized speech.
The next big test for technology is to determine if someone who does not speak can learn to use the system without being able to train it in their own voice and generalize it to anything they want to say.
Gopala K. Anumanchipalli et al. 2019. Synthesis of speech from neural decoding of spoken sentences. Nature 568: 493-498; doi: 10.1038 / s41586-019-1119-1