Cocktail Party Problem Decoded: Sound Location Is Key to Hearing in a Crowd

Summary: For decades, neuroscientists have wondered how humans can isolate a single conversation in a loud room—a phenomenon known as the “cocktail party problem.” Neuroscientists have finally provided a computational explanation.

In a study, researchers used a modified neural network to show that a simple “multiplicative gain” (amplifying neurons tuned to a target’s pitch or location) is enough to explain selective attention. The model doesn’t just identify sounds; it mimics human errors and spatial quirks, proving that the brain’s “volume knob” for specific vocal features is the key to focus.

Key Facts

The Multiplicative Gain: When you focus on a voice, neurons tuned to that voice’s specific features (like pitch) scale their activity upward, effectively “multiplying” the signal.
Feature Cues: The model uses a brief “cue” of a voice to determine which neural units to boost. If a voice is low-pitched, units representing low pitch get a large gain, while high-pitch units are attenuated.
Horizontal vs. Vertical Separation: The MIT team discovered that both humans and the model are much better at isolating voices when they are separated horizontally (left vs. right) than vertically (up vs. down).
Predictive Errors: The model perfectly mirrors human behavior, including the struggle to distinguish between two voices of the same gender (similar pitches).
Future Application: This research could revolutionize cochlear implants, helping users filter out background noise more effectively in crowded environments.

Source: MIT

MIT neuroscientists have figured out how the brain is able to focus on a single voice among a cacophony of many voices, shedding light on a longstanding neuroscientific phenomenon known as the cocktail party problem.

This attentional focus becomes necessary when you’re in any crowded environment, such as a cocktail party, with many conversations going on at once. Somehow, your brain is able to follow the voice of the person you’re talking to, despite all the other voices that you’re hearing in the background.

Using a computational model of the auditory system, the MIT team found that amplifying the activity of the neural processing units that respond to features of a target voice, such as its pitch, allows that voice to be boosted to the forefront of attention.

“That simple motif is enough to cause much of the phenotype of human auditory attention to emerge, and the model ends up reproducing a very wide range of human attentional behaviors for sound,” says Josh McDermott, a professor of brain and cognitive sciences at MIT, a member of MIT’s McGovern Institute for Brain Research and Center for Brains, Minds, and Machines, and the senior author of the study.

The findings are consistent with previous studies showing that when people or animals focus on a specific auditory input, neurons in the auditory cortex that respond to features of the target stimulus amplify their activity. This is the first study to show that extra boost is enough to explain how the brain solves the cocktail party problem.

Ian Griffith, a graduate student in the Harvard Program in Speech and Hearing Biosciences and Technology, who is advised by McDermott, is the lead author of the paper. MIT graduate student R. Preston Hess is also an author of the paper, which appears today in Nature Human Behavior.

Modeling attention

Neuroscientists have been studying the phenomenon of selective attention for decades. Many studies in people and animals have shown that when focusing on a particular stimulus like the sound of someone’s voice, neurons that are tuned to features of that voice — for example, high pitch — amplify their activity.

When this amplification occurs, neurons’ firing rates are scaled upward, as though multiplied by a number greater than one. It has been proposed that these “multiplicative gains” allow the brain to focus its attention on certain stimuli. Neurons that aren’t tuned to the target feature exhibit a corresponding reduction in activity.

“The responses of neurons tuned to features that are in the target of attention get scaled up,” Griffith says. “Those effects have been known for a very long time, but what’s been unclear is whether that effect is sufficient to explain what happens when you’re trying to pay attention to a voice or selectively attend to one object.”

This question has remained unanswered because computational models of perception haven’t been able to perform attentional tasks such as picking one voice out of many. Such models can readily perform auditory tasks when there is an unambiguous target sound to identify, but they haven’t been able to perform those tasks when other stimuli are competing for their attention.

“None of our models has had the ability that humans have, to be cued to a particular object or a particular sound and then to base their response on that object or that sound. That’s been a real limitation,” McDermott says.

In this study, the MIT team wanted to see if they could train models to perform those types of tasks by enabling the model to produce neuronal activity boosts like those seen in the human brain.

To do that, they began with a neural network that they and other researchers have used to model audition, and then modified the model to allow each of its stages to implement multiplicative gains. Under this architecture, the activation of processing units within the model can be boosted up or down depending on the specific features they represent, such as pitch.

To train the model, on each trial the researchers first fed it a “cue”: an audio clip of the voice that they wanted the model to pay attention to. The unit activations produced by the cue then determined the multiplicative gains that were applied when the model heard a subsequent stimulus.

“Imagine the cue is an excerpt of a voice that has a low pitch. Then, the units in the model that represent low pitch would get multiplied by a large gain, whereas the units that represent high pitch would get attenuated,” Griffith says.

Then, the model was given clips featuring a mix of voices, including the target voice, and asked to identify the second word said by the target voice. The model activations to this mixture were multiplied by the gains that resulted from the previous cue stimulus. This was expected to cause the target voice to be “amplified” within the model, but it was not clear whether this effect would be enough to yield human-like attentional behavior.

The researchers found that under a variety of conditions, the model performed very similarly to humans, and it tended to make errors similar to those that humans make. For example, like humans, it sometimes made mistakes when trying to focus on one of two male voices or one of two female voices, which are more likely to have similar pitches.

“We did experiments measuring how well people can select voices across a pretty wide range of conditions, and the model reproduces the pattern of behavior pretty well,” Griffith says.

Effects of location

Previous research has shown that in addition to pitch, spatial location is a key factor that helps people focus on a particular voice or sound. The MIT team found that the model also learned to use spatial location for attentional selection, performing better when the target voice was at a different location from distractor voices.

The researchers then used the model to discover new properties of human spatial attention. Using their computational model, the researchers were able to test all possible combinations of target locations and distractor locations, an undertaking that would be hugely time-consuming with human subjects.

“You can use the model as a way to screen large numbers of conditions to look for interesting patterns, and then once you find something interesting, you can go and do the experiment in humans,” McDermott says.

These experiments revealed that the model was much better at correctly selecting the target voice when the target and distractor were at different locations in the horizontal plane. When the sounds were instead separated in the vertical plane, this task became much more difficult. When the researchers ran a similar experiment with human subjects, they observed the same result.

“That was just one example where we were able to use the model as an engine for discovery, which I think is an exciting application for this kind of model,” McDermott says.

Another application the researchers are pursuing is using this kind of model to simulate listening through a cochlear implant. These studies, they hope, could lead to improvements in cochlear implants that could help people with such implants focus their attention more successfully in noisy environments.

Funding: The research was funded by the National Institutes of Health.

Key Questions Answered:

Q: Why is it so hard to hear one person when everyone is talking at once?

A: Your brain is constantly fighting a “signal-to-noise” battle. Every voice hitting your ears competes for the same neural processing units. Without selective attention, your brain treats the background chatter with the same importance as your friend’s voice. This study shows that you “hear” your friend because your brain is literally multiplying their signal while turning down the “gain” on everyone else.

Q: Does “tuning out” someone actually happen in the brain?

A: Yes. The MIT model confirms that while neurons representing the person you want to hear get a boost, the neurons representing “distractor” voices exhibit a reduction in activity. You aren’t just hearing one person better; you are actively suppressing the rest.

Q: Why is it easier to hear someone if they move to my left or right?

A: The study found a “horizontal advantage.” Our brains (and the model) are highly optimized to use the time delay between sound hitting the left vs. right ear to separate voices. We are much less efficient at vertical separation, likely because our evolutionary environment required us to track threats and voices on the ground more often than from above or below.

Editorial Notes:

This article was edited by a Neuroscience News editor.
Journal paper reviewed in full.
Additional context added by our staff.

About this auditory neuroscience research news

Author: Sarah McDonnell
Source: MIT
Contact: Sarah McDonnell – MIT
Image: The image is credited to Neuroscience News

Original Research: Open access.
“Optimized feature gains explain and predict successes and failures of human selective listening” by Ian M. Griffith, R. Preston Hess & Josh H. McDermott. Nature Human Behavior
DOI:10.1038/s41562-026-02414-7

Abstract

Optimized feature gains explain and predict successes and failures of human selective listening

Attention facilitates communication by enabling selective listening to sound sources of interest. However, little is known about why attentional selection succeeds in some conditions but fails in others.

While neurophysiology implicates multiplicative feature gains in selective attention, it is unclear whether such gains can explain real-world attention-driven behaviour.

Here we optimized an artificial neural network with stimulus-computable feature gains to recognize a cued talker’s speech from binaural audio in ‘cocktail party’ scenarios.

Though not trained to mimic humans, the model produced human-like performance across diverse real-world conditions, exhibiting selection based both on voice qualities and on spatial location as well as selection failures in conditions where humans tended to fail.

It also predicted novel attentional effects that we confirmed in human experiments, and exhibited signatures of ‘late selection’ like those seen in human auditory cortex. The results suggest that human-like attentional strategies naturally arise from the optimization of feature gains for selective listening.