Summary: When faced with complex choices, people show bursts of exploration before settling into preferred options of higher value.
Source: Penn State
In a world that offers a seemingly unending number of options and opportunities, people may rely on the overall complexity of alternative options to help them make choices in uncertain environments, according to researchers.
In a study recently published in the journal Cognition, the researchers found that when participants faced complex choices, they often showed a burst of exploration before settling into preferred options of higher value. Instead of trying to represent the values of all of the alternatives, adaptive decision-making was supported by selectively maintaining high-value options while forgetting the rest. This strategy may be one way that people can conserve their cognitive resources and solve problems that exceed their working memory capacity.
It might, for example, explain why people have their go-to meals when they visit restaurants, said Michael Hallquist, assistant professor of psychology at Penn State and Institute for CyberScience co-hire.
“There is a set of neural circuits—and cognitive processes that these circuits instantiate—that help you remember the value of different actions, so if you go to a restaurant and try the steak and it was fantastic, the next time you’ll usually remember that,” said Hallquist. “The difficulty, though, is that at any given moment, you’re faced with so many possibilities that you can’t possibly evaluate all of the alternatives in detail. In the decision-making literature, this has been called the exploration-exploitation dilemma. Keeping this in the context of the restaurant example, exploration would be ordering something you haven’t tried before and exploitation would be going back to the steak you know is going to be good. By comparison, if you had previously tried the lasagna and it was unremarkable, would you remember this as clearly as the steak?”
To study the exploration-exploitation dilemma, the researchers recruited 76 participants to complete a timed task that was divided into eight sessions or runs. Each run consisted of 50 trials. During a trial, a clock hand revolved around an image of a face with a happy or unhappy expression, for example, or some other abstract image. The subjects could stop the revolving hand and, depending on when they decided to stop the hand, received a reward of between 0 and 150 points. To create an uncertain environment, the payoff for choices was inconsistent and varied as a function of time. In some runs, the researchers rewarded the subject when he or she waited, whereas in other runs, the contingency rewarded subjects who responded more quickly.
“They didn’t know any of this going into the test, they had to learn it as they went,” said Hallquist. “They’re learning whether they should wait, or whether they should act quickly. It sounds easy, but it can be tricky because the payouts are probabilistic,” said Hallquist. “So, you may choose to respond in two seconds and receive 100 points. And then you may hit that mark again and get no points, so people have to integrate the long-run outcomes.”
Using mathematical models of decision-making, Hallquist and Alexandre Y. Dombrovski, associate professor of psychiatry, University of Pittsburgh, found that subjects’ decisions were consistent with a strategy of selectively maintaining high-value response times. Alternative models that represented the values of all response times, or that promoted or discouraged responses based on uncertainty were not supported. Altogether, these results suggest that people solve the exploration-exploitation in part by sampling many different options, then compressing the information that they need to track, according to Hallquist.
“You are assigning and holding onto things that are especially valuable, and you devote cognitive horsepower to representing those things with high fidelity,” said Hallquist. “This helps you solve this really hard problem because you can’t represent all of it, so there has to be some way of compressing this information.”
Better representing real-life decisions
The experiment was designed to better represent decisions in real life, according to Hallquist.
“In most decision-making experiments, you’re only a choosing among a few things, but in real life, you’re faced with many, many options,” he said. “We saw a timed task with varying outcomes as a more realistic test of how people make these exploratory versus exploitative choices in a complex environment.”
In the future, the researchers are planning to analyze data from fMRI—functional magnetic resonance imaging—to better understand how the brain represents decision-relevant signals during the tasks.
Brianne Fagan – Penn State
The image is in the public domain.
Original Research: Open access
“Selective maintenance of value information helps resolve the exploration/exploitation dilemma”. Michael N. Hallquist, Alexandre Y. Dombrovski.
Selective maintenance of value information helps resolve the exploration/exploitation dilemma
In natural environments with many options of uncertain value, one faces a difficult tradeoff between exploiting familiar, valuable options or searching for better alternatives. Reinforcement learning models of this exploration/exploitation dilemma typically modulate the rate of exploratory choices or preferentially sample uncertain options. The extent to which such models capture human behavior remains unclear, in part because they do not consider the constraints on remembering what is learned.
Using reinforcement-based timing as a motivating example, we show that selectively maintaining high-value actions compresses the amount of information to be tracked in learning, as quantified by Shannon’s entropy. In turn, the information content of the value representation controls the balance between exploration (high entropy) and exploitation (low entropy). Selectively maintaining preferred action values while allowing others to decay renders the choices increasingly exploitative across learning episodes.
To adjudicate among alternative maintenance and sampling strategies, we developed a new reinforcement learning model, StrategiC ExPloration/ExPloitation of Temporal Instrumental Contingencies (SCEPTIC). In computational studies, a resource-rational selective maintenance approach was as successful as more resource-intensive strategies. Furthermore, human behavior was consistent with selective maintenance; information compression was most pronounced in subjects with superior performance and non-verbal intelligence, and in learnable vs. unlearnable contingencies. Cognitively demanding uncertainty-directed exploration recovered a more accurate representation in simulations with no foraging advantage and was strongly unsupported in our human study.