Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability

Authors

DOI:

https://doi.org/10.5577/intdentres.662

Keywords:

Artificial intelligence, large language models (LLMs), ChatGPT, Gemini, DeepSeek, patient questions in endodontics, accuracy, understandability, PEMAT-P, readability

Abstract

Aim: The aim of this study is to compare three large language models (LLMs) (ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1) in terms of accuracy, understandability, and readability, based on the answers provided to frequently asked endodontic questions.

Methodology: Thirty open-ended frequently asked questions were generated using the AlsoAsked and AnswerThePublic websites. Two experienced endodontists scored the accuracy of the responses using a 5-point Likert scale. The understandability of the responses was analyzed using the Patient Education Materials Assessment Tool for Printed Materials (PEMAT-P) tool. Readability was assessed using the Flesch-Kincaid Readability Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI). Group comparisons were performed using the ANOVA/Kruskal-Wallis test, followed by post-hoc Dunn-Bonferroni tests.

Results: Inter-rater agreement was excellent (accuracy ICC: 0.908–0.917; reliability ICC: 0.992–0.995; all p < 0.001). A significant difference was found between the models in terms of accuracy (p < 0.001): DeepSeek-V3.1 (4.63 ± 0.81) scored the highest and performed significantly better than ChatGPT-5 (3.93 ± 0.79) and Gemini 2.5 Flash (3.67 ± 0.76). No significant difference between ChatGPT-5 and Gemini 2.5 Flash (p > 0.05). The understandability (PEMAT-P) scores were similar (p = 0.683), and all models scored above 70% (ChatGPT-5, 77.46%; Gemini, 76.04%; DeepSeek-V3.1, 77.57%). Differences were found in readability metrics: DeepSeek-V3.1 scored higher than ChatGPT-5 in FRES (p = 0.044); Gemini scored higher than DeepSeek-V3.1 in FKGL (p = 0.001); in GFI, Gemini 2.5 Flash scored higher than both ChatGPT-5 (p = 0.036) and DeepSeek-V3.1 (p < 0.001); in SMOG, Gemini outperformed DeepSeek-V3.1 (p = 0.003); and in CLI, ChatGPT-5 was higher than DeepSeek-V3.1 (p = 0.004). No significant correlation was found between readability and understandability (p > 0.05).

Conclusion: DeepSeek-V3.1 outperformed ChatGPT-5 and Gemini 2.5 Flash in terms of accuracy. While all models produced similar scores above the PEMAT-P understandability threshold (70%), there were significant differences in readability metrics; furthermore, no model consistently reached the recommended 6th–8th grade level.

Downloads

Download data is not yet available.

Downloads

Published

31.12.2025

How to Cite

1.
Taşyürek M, Adıgüzel Özkan, Gündoğar M, Goncharuk-Khomyn M, Ortaç H. Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability. Int Dent Res. 2025;15(3):123-135. doi:10.5577/intdentres.662
Smart Citations via scite_

Similar Articles

You may also start an advanced similarity search for this article.

Most read articles by the same author(s)

1 2 3 4 > >> 

Accuracy comparison of chatbot responses in temporomandibular joint disorders

Esra Nur Avukat, Mirac Berke Topcu Ersöz , Canan Akay
Abstract View : 0

Accuracy comparison of chatbot responses in temporomandibular joint disorders

Esra Nur Avukat, Mirac Berke Topcu Ersöz , Canan Akay
Abstract View : 0

Comparison of Artificial Intelligence Responses to Parents' Frequently Asked Questions About Pediatric Dental Treatments: Blind Expert Evaluation

Enes BARDAKCI, Hazal ARSLANPARCASI, Mehmet Ali ARSLANPARCASI, Peris CELIKEL, Mehmet Sinan DOGAN
Abstract View : 0