Summary: A new study reveals that while Large Language Models (LLMs) can produce outputs that appear highly creative in isolation, they suffer from a fundamental lack of diversity. By testing humans against a broad range of models (including Gemini, GPT, and Llama) on standard creativity tasks—such as finding alternative uses for common objects—the researchers found that AI responses are significantly more similar to one another than human responses.
While an individual AI response might be rated as “more creative” than the average human’s, the collective output of LLMs is remarkably homogeneous. The study suggests that relying on AI for brainstorming or art risks narrowing the scope of human thinking, as these models lack the lived experience and individuality that drive true human innovation.
Key Facts
- The Sameness Paradox: In isolation, a single LLM response often outscores the average human in creativity tests, but across multiple trials, LLMs consistently provide the same “creative” ideas.
- Temperature vs. Logic: Increasing a model’s “temperature” (randomness) makes outputs more diverse, but it quickly leads to “gibberish” that fails to meet task requirements, highlighting a rigid limit to AI “imagination.”
- Cross-Model Consistency: The tendency toward boring, repetitive outputs was found across all major models (Gemini, GPT, Llama), suggesting the issue is inherent to the nature of LLMs rather than any specific brand.
- The “Human” Element: Researchers argue that LLMs may never truly reach human-level creativity because they lack bodies, intentions, and personal experiences—the core ingredients of unique thought.
Source: PNAS Nexus
Can using a large language model (LLM) make a person more creative?
Prior work has shown that using LLMs can make creative outputs more homogeneous,but this homogenization could stem from the specific LLM used or from widespread use of the same model.
Emily Wenger and Yoed N. Kenett asked humans recruited from the Prolific platform and a broad range of LLMs to complete multiple tasks designed to measure different facets of creativity.
For example, one task asked participants to come up with as many uses as possible for an item like a fork or a pair of pants. Another task asked participants to think of 10 nouns that are as different from one another as possible.
Across the board, the authors found that LLM responses were significantly more similar to each other than human responses. In isolation, a single LLM response to a task was typically rated as roughly equally creative or more creative than the average human response.
However, when compared to other outputs from other LLMs—whether Gemini, GPT, or Llama—similar ideas and responses emerged again and again.
Increasing model temperature, which describes the level of randomness in model outputs, made the responses more variable than those produced by lower-temperature settings, but higher temperatures also quickly turned the outputs into gibberish that did not fulfill task requirements.
According to the authors, it is likely the use of LLMs in general, rather than the use of any specific LLM, that causes outputs to be homogeneous.
Whether LLMs can be improved to reach or surpass human creativity is an open question, given that by their nature, they lack bodies, experiences, intentions, individuality, or understanding, some or all of which may be necessary to simulate human creativity.
According to the authors, relying on LLMs for brainstorming, problem solving, or making art risks harming human thinking.
Key Questions Answered:
A: On paper, sometimes. A single AI response to “how many uses for a fork?” might be clever. But if you ask 100 AIs, they’ll all give you the same clever answers, whereas 100 humans will give you 100 different paths.
A: You can increase the “temperature,” but it backfires. As the AI gets more “random” to find new ideas, it loses its grip on logic, eventually producing nonsense that doesn’t actually solve the problem.
A: It’s an open question. Many researchers believe true creativity requires “lived experience” and “intention”—things a cold, statistical model simply doesn’t have. AI can simulate patterns, but it can’t “experience” an idea.
Editorial Notes:
- This article was edited by a Neuroscience News editor.
- Journal paper reviewed in full.
- Additional context added by our staff.
About this AL and creativity research news
Author: Emily Wenger
Source: PNAS Nexus
Contact: Emily Wenger – PNAX Nexus
Image: The image is credited to Neuroscience News
Original Research: Closed access.
“Large language models are homogeneously creative” by Emily Wenger and Yoed N. Kenett. PNAS Nexus
DOI:10.1093/pnasnexus/pgag042
Abstract
Large language models are homogeneously creative
Numerous large language models (LLMs) are marketed for use as creativity support tools, despite several studies showing that using an LLM as a creative partner narrows creative outputs.
However, these studies only consider the effects of interacting with a single LLM on specific creativity tasks, begging the question of whether narrowed creativity stems from using a particular LLM—with an arguably limited range of outputs—or from using LLMs in general.
To test this, we elicit creative responses from many humans and LLMs using standardized creativity tasks and compare population-level response diversity. We find that LLM responses mirror other LLM responses far more than humans do other humans, even after controlling for key confounding variables.
This finding adds a new dimension to the ongoing discussion about creativity and LLMs. If today’s LLMs behave similarly, using them as creative partners—regardless of the model used—may drive users toward similar “creative” outputs.


