AI Voices Outperform Human Speech in Noisy Environments

Summary: In a surprising twist for acoustic science, researchers have discovered that AI-generated voice clones are significantly easier to understand than actual human speakers.

The study found that despite being trained on as little as 10 seconds of audio, these synthetic facsimiles were up to 20% more intelligible than humans, particularly in noisy settings.

Key Findings

20% More Intelligible: Voice clones aren’t just “good enough”; they are statistically superior in transmission clarity.
Efficiency: Unlike traditional synthetic voices, clones can be created with 10 seconds of data, making them highly scalable for telecommunications and accessibility tools.
Clinical Value: This research suggests that AI-enhanced speech could be a game-changer for individuals with hearing impairments or those using assistive listening devices.

Source: AIP

Synthetic voices are increasingly a part of our lives, from digital assistants like Siri and Alexa to automated telemarketers and answering machines. With the expansion of generative AI, a new type of synthetic voice has been developed: voice clones, which can recreate a facsimile of a person’s voice from only a few seconds of recorded speech.

In JASA, published on behalf of the Acoustical Society of America by AIP Publishing, a pair of researchers from University College London and the University of Roehampton evaluated the intelligibility of humans and voice clones. They found that voice clones are easier than humans to understand in noisy environments.

Voice clones differ from traditional synthetic voices in the amount of sampling they require. Synthetic voices like Siri require a voice actor to spend hours in a recording booth. In contrast, a voice clone can be made from as little as 10 seconds of speech, significantly expanding the number of potential voices as well as the number of potential applications.

Researchers Patti Adank and Han Wang specialize in studying human perception of unclear speech and were fascinated by the idea of machine-replicated speech. A key question they were looking to answer was just how easy voice clones are for the average person to understand.

They suspected that voice clones would simply be poor representations of actual human voices and that people would struggle to understand them. What they found could not be more different.

“I thought initially that voice clones would be less intelligible because they were unfamiliar,” said Adank.

“I found they were up to 20% more intelligible, which was quite shocking. A small part of our paper is talking about that experiment, and then a large part is me and my collaborator frantically trying to find out what it is that makes those voice clones more intelligible.”

The duo initially presented volunteers with human voices and voice clones, asking them to rate their intelligibility. After finding that voice clones were consistently rated easier to understand, they repeated the experiment with elderly volunteers to determine if being hard-of-hearing alters the effect; with American volunteers — the original cohort was British — to judge if the accent plays a role; and with a filter designed to mimic cochlear implants. In every case, voice clones emerged victorious.

After examining over 100 acoustic measurements, Adank believes the only way to solve the mystery is to work with collaborators who specialize in text-to-speech systems to adapt an existing open-source cloning system.

“I am now going to try and recreate [the effect] by studying how synthesizers work and how they use digital signal processing to generate those voices, just to get a bit of a handle on this,” said Adank.

Key Questions Answered:

Q: Does this mean we will eventually prefer talking to AI over people?

A: For raw information, like directions or customer support in a loud room, your brain might already prefer the AI. However, human speech carries emotional nuance and “soul” that clones still struggle to replicate perfectly. We may prefer AI for clarity, but humans for connection.

Q: Why does the AI sound clearer to someone with a cochlear implant?

A: Cochlear implants struggle with the “noise” and biological imperfections of human speech. AI voices are digitally precise, providing a cleaner signal that is easier for the implant’s processor to translate into electrical pulses for the brain.

Q: Can this technology be used to help people with speech impediments?

A: Yes. By understanding what makes AI speech so intelligible, we can develop “speech enhancers” that take a human voice in real-time and digitally “clean it up” using these discovered acoustic rules to help others understand them better.

Editorial Notes:

This article was edited by a Neuroscience News editor.
Journal paper reviewed in full.
Additional context added by our staff.

About this AI and auditory neuroscience research news

Author: Hannah Daniel
Source: AIP
Contact: Hannah Daniel – AIP
Image: The image is credited to Neuroscience News

Original Research: Open access.
“Voice clones are easier to understand in noise than their human originals: the voice cloning intelligibility benefit” by Patti Adank and Han Wang. JASA
DOI:10.1121/10.0043094

Abstract

Voice clones are easier to understand in noise than their human originals: the voice cloning intelligibility benefit

Voice cloning technology has developed rapidly and can currently produce high-quality humanlike voices from as little as 10 s of speech. It is unclear whether cloned voices are as intelligible as their human originals.

We compared the intelligibility of ten human voices with their ten voice clones in background noise. Eighty participants listened to 80 sentences (40 human, 40 cloned), presented in four signal-to-noise ratios (+3, 0, −3, and −6 dB) in an online experiment.

Cloned voices were up to 13.4% more intelligible than their human counterparts across all noise levels. Principal component analysis with linear discriminant analysis classified human and cloned voices correctly in 79.4% of cases based on an extensive set of acoustic measurements, confirming systematic acoustic differences between the two voice types.

Human listeners identified human voices with 70.4% accuracy. Elastic net regression analyses indicated that intelligibility in cloned voices was driven mainly by pitch and harmonic measures, whereas formant- and vowel-space measures were more influential for human voices.

Our findings have implications for applications of voice cloning, including voice restoration, speech synthesis for non-verbal individuals, and accessibility for people with hearing loss.