Summary: A new study comparing human and AI-generated conversations reveals that large language models like ChatGPT and Claude still fail to convincingly mimic natural human dialogue. Researchers found that these systems over-imitate their conversation partners, misuse filler words such as “well” or “like,” and struggle with natural openings and closings.
This “exaggerated alignment” gives away their artificial nature, even when their grammar and logic are flawless. While AI speech continues to evolve rapidly, researchers say key social subtleties in human interaction may remain out of reach.
Key Facts:
- Exaggerated Imitation: AI models mimic human speech patterns too intensely, a trait humans instinctively recognize as unnatural.
- Filler Word Misuse: Large language models misplace or overuse discourse markers like “so” or “well,” breaking conversational flow.
- Poor Transitions: AI often fails at natural openings and endings, missing the social nuances that frame human dialogue.
Source: NTNU
It is easy to be impressed by artificial intelligence. Many people use large language models such as ChatGPT, Copilot and Perplexity to help solve a variety of tasks or simply for entertainment purposes.
But just how good are these large language models at pretending to be human?
Not very, according to recent research.
“Large language models speak differently than people do,” said Associate Professor Lucas Bietti from the Norwegian University of Science and Technology’s (NTNU) Department of Psychology.
Bietti was one of the authors of a research article recently published in PMC. The lead author is Eric Mayor from the University of Basel, while the final author is Adrian Bangerter from the University of Neuchâtel.
Tested several models
The large language models the researchers tested were ChatGPT-4, Claude Sonnet 3.5, Vicuna and Wayfarer.
- Firstly, they independently compared transcripts of phone conversations between humans with simulated conversations in the large language models.
- They then checked whether other people could distinguish between the human phone conversations and those of the language models.
For the most part, people are not fooled – or at least not yet. So what are the language models doing wrong?
Too much imitation
When people talk to each other, there is a certain amount of imitation that goes on. We slightly adapt our words and the conversation according to the other person. However, the imitation is usually quite subtle.
“Large language models are a bit too eager to imitate, and this exaggerated imitation is something that humans can pick up on,” explained Bietti.
This is called ‘exaggerated alignment’.
But that is not all.
Incorrect use of filler words
Movies with bad scripts usually have conversations that sound artificial. In such cases, the scriptwriters have often forgotten that conversations do not only consist of the necessary content words. In real, everyday conversations, most of us include small words called ‘discourse markers’.
These are words like ‘so’, ‘well’, ‘like’ and ‘anyway’.
These words have a social function because they can signal interest, belonging, attitude or meaning to the other person. In addition, they can also be used to structure the conversation.
Large language models are still terrible at using these words.
“The large language models use these small words differently, and often incorrectly,” said Bietti.
This helps to expose them as non-human. But there is more.
Opening and closing features
When you start talking to someone, you probably do not get straight to the point. Instead, you might start by saying ‘hey’ or ‘so, how are you doing?’ or ‘oh, fancy seeing you here’. People tend to engage in small talk before moving on to what they actually want to talk about.
This shift from introduction to business takes place more or less automatically for humans, without being explicitly stated.
“This introduction, and the shift to a new phase of the conversation, are also difficult for large language models to imitate,” said Bietti.
The same applies to the end of the conversation. We usually do not end a conversation abruptly as soon as the information has been conveyed to the other person. Instead, we often end the conversation with phrases like ‘alright, then’, ‘okay’, ‘talk to you later’, or ‘see you soon’.
Large language models do not quite manage that part either.
Better in the future? Probably
Altogether, these features cause so much trouble for the large language models that the conclusion is clear:
“Today’s large language models are not yet able to imitate humans well enough to consistently fool us,” said Bietti.
Developments in this field are now progressing so rapidly that large language models will most likely be able to do this quite soon – at least if we want them to. Or will they?
“Improvements in large language models will most likely manage to narrow the gap between human conversations and artificial ones, but key differences will probably remain,” concluded Bietti.
For the time being, large language models are still not human-like enough to fool us. At least not every time.
Key Questions Answered:
A: AI tends to over-imitate and lacks subtle conversational cues—especially in timing, phrasing, and social rhythm—that make human speech flow naturally.
A: Incorrect use of filler words, awkward transitions, and overly formal phrasing make even advanced AI sound slightly robotic or scripted.
A: Possibly, but researchers suggest key differences in empathy, timing, and social intent may always distinguish humans from machines.
About this AI research news
Author: Nancy Bazilchuk
Source: NTNU
Contact: Nancy Bazilchuk – NTNU
Image: The image is credited to Neuroscience News
Original Research: Open access.
“Can Large Language Models Simulate Spoken Human Conversations?” by Lucas Bietti et al. Cognitive Science
Abstract
Can Large Language Models Simulate Spoken Human Conversations?
Large language models (LLMs) can emulate many aspects of human cognition and have been heralded as a potential paradigm shift.
They are proficient in chat-based conversation, but little is known about their ability to simulate spoken conversation. We investigated whether LLMs can simulate spoken human conversation.
In Study 1, we compared transcripts of human telephone conversations from the Switchboard (SB) corpus to six corpora of transcripts generated by two powerful LLMs, GPT-4 and Claude Sonnet 3.5, and two open-source LLMs, Vicuna and Wayfarer, using different prompts designed to mimic SB participants’ instructions.
We compared LLM and SB conversations in terms of alignment (conceptual, syntactic, and lexical), coordination markers, and coordination of openings and closings.
We also documented qualitative features by which LLM conversations differ from SB conversations.
In Study 2, we assessed whether humans can distinguish transcripts produced by LLMs from those of SB conversations. LLM conversations exhibited exaggerated alignment (and an increase in alignment as conversation unfolded) relative to human conversations, different and often inappropriate use of coordination markers, and were dissimilar to human conversations in openings and closings.
LLM conversations did not consistently pass for SB conversations. Spoken conversations generated by LLMs are both qualitatively and quantitatively different from those of humans.
This issue may evolve with better LLMs and more training on spoken conversation, but may also result from key differences between spoken conversation and chat.