Stroop Test Exposes Inherent LLM Flaw

Summary: A new cognitive evaluation of artificial intelligence has unmasked a fundamental, systemic flaw running through large language model (LLM) attention mechanisms. By administering the classic psychological “Stroop task” to premier frontier models, including GPT-5, Claude Opus 4.1, and Gemini 2.5, researchers exposed a severe cognitive collapse in machine decision-making.

While biological human brains routinely suppress automatic impulses to maintain stable accuracy across long data sequences, transformer-based machine attention degrades rapidly under length pressure, dropping to near-zero accuracy when forced to inhibit its primary training instincts.

Key Facts

The Machine Attention Audit: Spearheaded by researcher Suketu Patel and an expert collective, the study aimed to explore the structural divergences between transformer-based machine attention and human cognitive attention. Investigators utilized the “Stroop task”, a pristine clinical test where color words are printed in mismatched colored ink, to evaluate executive control and the specific ability to inhibit an automatic response.
The Length-Dependent Performance Crash: The research team revealed that while LLMs handle short data sequences efficiently, their executive control shatters as token length scales. When evaluating a brief list of five mismatched words, the models performed well. However, as the length of the lists expanded, the AIs experienced a dramatic, catastrophic drop in task stability.
Frontier Model Degradation Metrics: The study documented precise mathematical failure thresholds across top-tier models:
- GPT-4o: Achieved a solid 91% accuracy at 5 words, which plummeted to 57% accuracy at 10 words, and collapsed entirely to a deficient 15% accuracy at 40 words.
- Claude 3.5 Sonnet: Maintained relative stability through a list of 20 words but crashed to a critical 24% accuracy when extended to 40 words.
The Mixed-List Near-Zero Failure: In complex trials featuring lists containing a chaotic mix of both matching and mismatched colors, the LLMs performed significantly worse. Under these mixed conditions, machine accuracy dropped to near 0% for the mismatched items, revealing a complete loss of task orientation.
The Pervasiveness of the Glitch: This operational vulnerability is not limited to older platforms. Identical patterns of cognitive collapse and focus degradation were verified in next-generation systems, including GPT-5, Claude Opus 4.1, and Gemini 2.5.
Biological vs. Synthetic Attention: Both humans and LLMs are fundamentally better trained on text-based word reading than on raw color naming. However, the human brain can successfully exert top-down executive control to suppress the automatic impulse of reading words, keeping its focus pristine across long sequences. The total performance collapse of LLMs exposes a fundamental architectural limitation in synthetic attention compared to biological attention.

Source: PNAS Nexus

Giving AI a classic psychological test reveals an inherent weakness in LLM decision-making abilities.  

Suketu Patel and colleagues explored how transformer-based machine attention differs from human attention by testing AI models on the “Stroop task,” in which words for colors are printed in colored ink, and participants are asked to name the ink color of each word while ignoring its meaning.

The task is clinically used to assess executive control, especially a person’s ability to inhibit an automatic response. Although humans generally take longer to answer correctly when words and colors are mismatched than when they match, they can still perform stably and with high accuracy even on long word lists.

The authors found that when the word and ink color did not match, LLMs performed well with a list of five words. But as the list of words grew longer, AI performance degraded dramatically. GPT-4o dropped from 91% accuracy at 5 words to 57% accuracy at 10 words and 15% accuracy at 40 words. Claude 3.5 Sonnet was stable through 20 words, but crashed to 24% accuracy at 40 words. In trials with a list of words in both matching and mismatched colors, LLM performance was even worse, dropping to near 0% accuracy for the mismatched items.

Similar results were found with GPT-5, Claude Opus 4.1, and Gemini 2.5. LLMs struggled to stay focused on naming the color rather than defaulting to word reading.

As with humans, LLMs are better trained on word reading than on color naming, yet humans can suppress word reading in long lists and maintain focus on the task at hand. According to the authors, the performance collapse of LLMs suggests fundamental limitations compared with biological attention.

Key Questions Answered:

Q: Why does a simple word-and-color game completely break the decision-making engine of advanced AI models?

A: Because the Stroop task tests a specific skill called executive control—the ability to intentionally block an automatic response. LLMs are trained above all else to read and predict text. When forced to ignore a word’s meaning and report only its font color, the AI’s primary text-reading training overrides its instructions as data sequence length scales, causing the model to default back to its automatic reading habit.

Q: How poorly did next-generation systems like GPT-5 and Claude Opus 4.1 perform on longer word lists?

A: They crashed to near-total failure. While models like GPT-4o started strong at 91% accuracy for ultra-short lists, expanding the dataset to just 40 words dragged their accuracy down to a broken 15%. When researchers tested the newest platforms, including GPT-5, Claude Opus 4.1, and Gemini 2.5, with mixed lists of matching and mismatched words, the systems collapsed to near 0% accuracy on the mismatched items.

Q: What does this failure tell cognitive neuroscientists about the difference between human and AI attention?

A: It reveals that transformer-based machine attention possesses a fundamental, structural limitation compared to biological minds. While both humans and AIs are naturally better at reading text than naming colors, human brains can maintain top-down control to suppress automatic reading loops across massive data streams. LLMs completely lack this sustainable top-down focus, proving that synthetic attention struggles to resist its own training biases when handling complex data arrays.

Editorial Notes:

This article was edited by a Neuroscience News editor.
Journal paper reviewed in full.
Additional context added by our staff.

About this AI reasoning research news

Author: Jin Fan
Source: PNAS Nexus
Contact: Jin Fan – PNAS Nexus
Image: The image is credited to Neuroscience News

Original Research: Closed access.
“Deficient executive control in transformer attention” by Suketu Chandrakant Patel, Hongbin Wang, and Jin Fan. PNAS Nexus
DOI:10.1093/pnasnexus/pgag149

Abstract

Deficient executive control in transformer attention

Although transformers in large language models (LLMs) effectively implement a self-attention mechanism that has revolutionized natural language processing, they lack an explicit architecture for the executive control of attention found in humans, which is essential for resolving conflicts and selecting relevant information in the presence of competing computations and is critical for adaptive behavior.

To investigate the impact of this limitation in LLMs, we employed the classic color Stroop task, widely regarded as the gold standard, to test the executive control of attention in these models.

Our results revealed a typical conflict effect of underperformance in terms of accuracy in the incongruent condition (e.g. naming the color of the word RED in blue) compared with the congruent condition (e.g. naming the color of the word RED in red), in short word lists, similar to human performance.

However, as the length of the word lists increased, performance on the incongruent condition degraded toward near-total performance collapse, even as accuracy in the congruent condition remained excellent, and word reading (e.g. reading the word RED [in red] or RED [in blue], ignoring the color) was near-perfect.

These findings demonstrate that transformer attention mechanisms are fundamentally limited in their capacity for conflict resolution across extended contexts, and a failure to up-regulate control adaptively under rising interference.

We suggest that incorporating executive control mechanisms akin to those in biological attention is crucial for achieving artificial general intelligence.