AI-Generated Medical Images Deceive Even Top Radiologists

Summary: A multi-center international study reveals that neither experienced radiologists nor advanced multimodal large language models (LLMs) can reliably distinguish “deepfake” X-rays from authentic ones. The research tested 17 radiologists across six countries using AI-generated images from ChatGPT and RoentGen.

Even when warned that synthetic images were present, radiologists only averaged 75% accuracy in identifying them. The findings expose a high-stakes vulnerability in healthcare, ranging from fraudulent litigation (fabricated injuries) to cybersecurity threats where hackers could inject synthetic images into digital medical records to cause clinical chaos.

Key Facts

The Deception Rate: When unaware they were looking at fakes, only 41% of radiologists spontaneously noticed anything unusual about the AI-generated images.
AI Failing AI: Even GPT-4o—the model used to create the deepfakes—could not accurately detect all of its own fabrications, though it outperformed Google’s Gemini and Meta’s Llama models.
The “Too Perfect” Tell: Deepfake X-rays often appear unnaturally symmetrical; bones are overly smooth, spines are “too straight,” and fractures look “unusually clean.”
No Experience Shield: A radiologist’s years of experience (ranging from 0 to 40 years) did not correlate with better detection, though musculoskeletal specialists performed slightly better than others.

Source: RSNA

Neither radiologists nor multimodal large language models (LLMs) are able to easily distinguish artificial intelligence (AI)-generated “deepfake” X-ray images from authentic ones, according to a study published today in Radiology, a journal of the Radiological Society of North America (RSNA).

The findings highlight the potential risks associated with AI-generated X-ray images, along with the need for tools and training to protect the integrity of medical images and prepare health care professionals to detect deepfakes.

This shows two x rays of a chest. — Researchers found that AI-generated X-rays (right) often feature unnaturally straight spines and overly uniform vessel patterns compared to authentic images. Credit: Neuroscience News

The term “deepfake” refers to a video, photo, image or audio recording that appears real but has been created or manipulated using AI.

“Our study demonstrates that these deepfake X-rays are realistic enough to deceive radiologists, the most highly trained medical image specialists, even when they were aware that AI-generated images were present,” said lead study author Mickael Tordjman, M.D., post-doctoral fellow, Icahn School of Medicine at Mount Sinai, New York.

“This creates a high-stakes vulnerability for fraudulent litigation if, for example, a fabricated fracture could be indistinguishable from a real one. There is also a significant cybersecurity risk if hackers were to gain access to a hospital’s network and inject synthetic images to manipulate patient diagnoses or cause widespread clinical chaos by undermining the fundamental reliability of the digital medical record.”

Seventeen radiologists from 12 different centers in six countries (United States, France, Germany, Turkey, United Kingdom and United Arab Emirates) participated in the retrospective study. Their professional experience ranged from 0 to 40 years. Half of the 264 X-ray images in the study were authentic, and the other half were generated by AI. Radiologists were evaluated on two distinct image sets, with no overlapping between the datasets.

The first dataset included real and ChatGPT-generated images of multiple anatomical regions. The second dataset included chest X-ray images—half authentic and the other half created by RoentGen, an open-source generative AI diffusion model developed by Stanford Medicine researchers.

When radiologist readers were unaware of the study’s true purpose, yet asked after ranking the technical quality of each ChatGPT image if they noticed anything unusual, only 41% spontaneously identified AI-generated images. After being informed that the dataset contained synthetic images, the radiologists’ mean accuracy in differentiating the real and synthetic X-rays was 75%.

Individual radiologist performance in accurately detecting the ChatGPT-generated images ranged from 58% to 92%. Similarly, the accuracy of four multimodal LLMs—GPT-4o (OpenAI), GPT-5 (OpenAI), Gemini 2.5 Pro (Google), and Llama 4 Maverick (Meta)—ranged from 57% to 85%. Even ChatGPT-4o, the model used to create the deepfakes, was unable to accurately detect all of them, though it identified the most by a considerable margin compared to Google and Meta LLMs.

Radiologist accuracy in detecting the RoentGen synthetic chest X-Rays ranged from 62% to 78% and the LLM models’ performance ranged from 52% to 89%.

There was no correlation between a radiologist’s years of experience and their accuracy in detecting synthetic X-ray images. However, musculoskeletal radiologists demonstrated significantly higher accuracy than other radiology subspecialists.

The study identified common features of synthetic X-rays.

“Deepfake medical images often look too perfect,” Dr. Tordjman said. “Bones are overly smooth, spines unnaturally straight, lungs overly symmetrical, blood vessel patterns excessively uniform, and fractures appear unusually clean and consistent, often limited to one side of the bone.”

Recommended solutions to clearly distinguish real and fake images and help prevent tampering include implementing advanced digital safeguards, such as invisible watermarks that embed ownership or identity data directly into the images and automatically attaching technologist-linked cryptographic signatures when the images are captured.

“We are potentially only seeing the tip of the iceberg,” Dr. Tordjman said. “The logical next step in this evolution is AI-generation of synthetic 3D images, such as CT and MRI. Establishing educational datasets and detection tools now is critical.”

The study’s authors have published a curated deepfake dataset with interactive quizzes for educational purposes.

Key Questions Answered:

Q: Why would someone bother making “fake” X-rays?

A: The risks are massive. It could be used for insurance fraud (faking a broken bone for a settlement) or, more dangerously, by hackers to manipulate a patient’s diagnosis, potentially leading to unnecessary surgeries or withheld treatments.

Q: If a radiologist can’t tell it’s fake, does it even matter?

A: Yes, because the fake image doesn’t represent the patient’s actual body. If an AI “hallucinates” a clear lung on a patient who actually has pneumonia, the patient won’t get the life-saving care they need.

Q: How can hospitals protect themselves from these deepfakes?

A: Researchers are calling for “digital watermarks” and cryptographic signatures that are attached to an image the moment a technician takes it, ensuring the file hasn’t been tampered with by AI.

Editorial Notes:

This article was edited by a Neuroscience News editor.
Journal paper reviewed in full.
Additional context added by our staff.

About this neurology and aging research news

Author: Linda Brooks
Source: RSNA
Contact: Linda Brooks – RSNA
Image: The image is credited to Neuroscience News

Original Research: Open access.
“The Rise of Deepfake Medical Imaging: Radiologists’ Diagnostic Accuracy in Detecting ChatGPT-generated Radiographs” by Mickael Tordjman, Murat Yuce, Amine Ammar, Mingqian Huang, Fadila Mihoubi Bouvier, Maxime Lacroix, Anis Meribout, Ian Bolger, Efe Ozkaya, Himanshu Joshi, Amine Geahchan, Rayane El Rahi, Haidara Almansour, Ashwin Singh Parihar, Carolyn Horst, Samet Ozturk, Muhammed Edip Isleyen, Gul Gizem Pamuk, Ahmet Tan Cimilli, Timothy Deyer, Arvin Calinghen, Enora Guillo, Rola Husain, Jean-Denis Laredo, Zahi A. Fayad, Xueyan Mei, and Bachir Taouli. Radiology
DOI:10.1148/radiol.25209

Abstract

The Rise of Deepfake Medical Imaging: Radiologists’ Diagnostic Accuracy in Detecting ChatGPT-generated Radiographs

Background

Large language models (LLMs) can generate realistic synthetic medical images (deepfakes), which raise concerns about potential misuse.

Purpose

To assess the ability of radiologists and multimodal LLMs to distinguish ChatGPT-generated synthetic radiographs from authentic clinical images.

Materials and Methods

This retrospective diagnostic accuracy study conducted between April and August 2025 included 17 practicing radiologists from six countries with varying experience levels. In phase 1, the radiologists, blinded to the purpose of the study, assessed image quality and provided diagnoses for 154 radiographs from multiple anatomic regions (77 synthetic images generated using ChatGPT [GPT-4o; OpenAI] and 77 authentic images). In phase 2, after being informed of the study’s purpose, the radiologists determined whether randomly presented radiographs were GPT-4o-generated or authentic.

The same classification task was performed by four LLMs: GPT-4o, GPT-5 (OpenAI), Gemini 2.5 Pro (Google), and Llama 4 Maverick (Meta). In phase 3, an additional set of 110 chest radiographs (55 synthetic images generated using RoentGen and 55 authentic images) was analyzed to evaluate the performance of readers and LLMs in distinguishing synthetic versus authentic images. The McNemar test and t test were used for comparisons.

Results

Forty-one percent (seven of 17) of purpose-blinded radiologists spontaneously identified artificial intelligence–generated radiographs as being present in the dataset. After being informed that some radiographs were synthetic, there was no evidence of a difference in overall accuracy among all 17 radiologists in distinguishing synthetic images in the GPT-4o dataset (75% [95% CI: 68, 81]) versus in the RoentGen dataset (70% [95% CI: 62, 78]; P = .07).

No tested LLM detected all synthetic radiographs in either dataset; however, GPT-4o-generated radiographs were more accurately differentiated from authentic ones by GPT-4o (accuracy, 85%) and GPT-5 (accuracy, 83%) compared with Llama 4 Maverick (accuracy, 59%) and Gemini 2.5 Pro (accuracy, 56%) (all P < .001). Common features of synthetic radiographs included bilateral symmetry, uniform grain or noise patterns, subtly unnatural soft-tissue textures, and overly smooth bone surfaces.

Conclusion

Synthetic radiographs (deepfakes) generated using an LLM were not easily distinguishable from authentic radiographs by either radiologists or LLMs. Training physicians and LLMs to recognize synthetic images is essential to mitigate risks. To support training, a curated deepfake dataset is available: https://noneedanick.github.io/DeepFakeXRay/.