Can You Spot the Bot? Study Finds ChatGPT Almost Undetectable in Medical Advice

Summary: A new study suggests that ChatGPT’s healthcare-related responses are hard to distinguish from those provided by human healthcare providers.

The study, involving 392 participants, presented a mix of responses from both ChatGPT and humans, finding participants correctly identified the chatbot and provider responses with similar accuracy.

However, the level of trust varied based on the complexity of the health-related task, with administrative tasks and preventive care being more trusted than diagnostic and treatment advice.

Key Facts:

In the study, participants correctly identified ChatGPT’s healthcare-related responses 65.5% of the time and human healthcare provider responses 65.1% of the time.
Trust in ChatGPT’s responses overall averaged a 3.4 out of 5 score, with higher trust for logistical questions and preventative care, but less for diagnostic and treatment advice.
The researchers suggest that chatbots could assist in patient-provider communication, particularly with administrative tasks and chronic disease management.

Source: NYU

ChatGPT’s responses to people’s healthcare-related queries are nearly indistinguishable from those provided by humans, a new study from NYU Tandon School of Engineering and Grossman School of Medicine reveals, suggesting the potential for chatbots to be effective allies to healthcare providers’ communications with patients.

An NYU research team presented 392 people aged 18 and above with ten patient questions and responses, with half of the responses generated by a human healthcare provider and the other half by ChatGPT.

Participants were asked to identify the source of each response and rate their trust in the ChatGPT responses using a 5-point scale from completely untrustworthy to completely trustworthy.

The study found people have limited ability to distinguish between chatbot and human-generated responses. On average, participants correctly identified chatbot responses 65.5% of the time and provider responses 65.1% of the time, with ranges of 49.0% to 85.7% for different questions. Results remained consistent no matter the demographic categories of the respondents.

The study found participants mildly trust chatbots’ responses overall (3.4 average score), with lower trust when the health-related complexity of the task in question was higher.

Logistical questions (e.g. scheduling appointments, insurance questions) had the highest trust rating (3.94 average score), followed by preventative care (e.g. vaccines, cancer screenings, 3.52 average score). Diagnostic and treatment advice had the lowest trust ratings (scores 2.90 and 2.89, respectively).

According to the researchers, the study highlights the possibility that chatbots can assist in patient-provider communication particularly related to administrative tasks and common chronic disease management.

Further research is needed, however, around chatbots’ taking on more clinical roles. Providers should remain cautious and exercise critical judgment when curating chatbot-generated advice due to the limitations and potential biases of AI models.

About this ChatGPT AI research news

Author: Oded Nov
Source: NYU
Contact: Oded Nov – NYU
Image: The image is credited to Neuroscience News

Original Research: Closed access.
“Putting ChatGPT’s Medical Advice to the (Turing) Test: Survey Study” by Oded Nov et al. JMIR Medical Education

Abstract

Putting ChatGPT’s Medical Advice to the (Turing) Test: Survey Study

Background: Chatbots are being piloted to draft responses to patient questions, but patients’ ability to distinguish between provider and chatbot responses and patients’ trust in chatbots’ functions are not well established.

Objective: This study aimed to assess the feasibility of using ChatGPT (Chat Generative Pre-trained Transformer) or a similar artificial intelligence–based chatbot for patient-provider communication.

Methods: A survey study was conducted in January 2023. Ten representative, nonadministrative patient-provider interactions were extracted from the electronic health record. Patients’ questions were entered into ChatGPT with a request for the chatbot to respond using approximately the same word count as the human provider’s response. In the survey, each patient question was followed by a provider- or ChatGPT-generated response. Participants were informed that 5 responses were provider generated and 5 were chatbot generated. Participants were asked—and incentivized financially—to correctly identify the response source. Participants were also asked about their trust in chatbots’ functions in patient-provider communication, using a Likert scale from 1-5.

Results: A US-representative sample of 430 study participants aged 18 and older were recruited on Prolific, a crowdsourcing platform for academic studies. In all, 426 participants filled out the full survey. After removing participants who spent less than 3 minutes on the survey, 392 respondents remained. Overall, 53.3% (209/392) of respondents analyzed were women, and the average age was 47.1 (range 18-91) years. The correct classification of responses ranged between 49% (192/392) to 85.7% (336/392) for different questions. On average, chatbot responses were identified correctly in 65.5% (1284/1960) of the cases, and human provider responses were identified correctly in 65.1% (1276/1960) of the cases. On average, responses toward patients’ trust in chatbots’ functions were weakly positive (mean Likert score 3.4 out of 5), with lower trust as the health-related complexity of the task in the questions increased.

Conclusions: ChatGPT responses to patient questions were weakly distinguishable from provider responses. Laypeople appear to trust the use of chatbots to answer lower-risk health questions. It is important to continue studying patient-chatbot interaction as chatbots move from administrative to more clinical roles in health care.