Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability

Makbule Taşyürek(1), Özkan Adıgüzel(2), Mustafa Gündoğar(3), Myroslav Goncharuk-Khomyn(4), Hatice Ortaç(5)
(1) Dicle University, Faculty of Dentistry, Department of Endodontics, Diyarbakır, Türkiye,
(2) Dicle University, Faculty of Dentistry, Department of Endodontics, Diyarbakır, Türkiye,
(3) İstanbul Medipol University, Faculty of Dentistry, Department of Endodontics, İstanbul, Türkiye,
(4) Uzhhorod National University, Faculty of Dentistry, Department of Restorative Dentistry, Uzhhorod, Ukraine,
(5) Dicle University, Faculty of Medicine, Department of Biostatistics, Diyarbakır, Türkiye

Abstract

Aim: The aim of this study is to compare three large language models (LLMs) (ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1) in terms of accuracy, understandability, and readability, based on the answers provided to frequently asked endodontic questions.


Methodology: Thirty open-ended frequently asked questions were generated using the AlsoAsked and AnswerThePublic websites. Two experienced endodontists scored the accuracy of the responses using a 5-point Likert scale. The understandability of the responses was analyzed using the Patient Education Materials Assessment Tool for Printed Materials (PEMAT-P) tool. Readability was assessed using the Flesch-Kincaid Readability Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI). Group comparisons were performed using the ANOVA/Kruskal-Wallis test, followed by post-hoc Dunn-Bonferroni tests.


Results: Inter-rater agreement was excellent (accuracy ICC: 0.908–0.917; reliability ICC: 0.992–0.995; all p<0.001). A significant difference was found between the models in terms of accuracy (p<0.001): DeepSeek-V3.1 (4.63±0.81) scored the highest and performed significantly better than ChatGPT-5 (3.93±0.79) and Gemini 2.5 Flash (3.67±0.76). No significant difference between ChatGPT-5 and Gemini 2.5 Flash (p>0.05). The understandability (PEMAT-P) scores were similar (p=0.683), and all models scored above 70% (ChatGPT-5, 77.46%; Gemini, 76.04%; DeepSeek-V3.1, 77.57%). Differences were found in readability metrics: DeepSeek-V3.1 scored higher than ChatGPT-5 in FRES (p=0.044); Gemini scored higher than DeepSeek-V3.1 in FKGL (p=0.001); in GFI, Gemini 2.5 Flash scored higher than both ChatGPT-5 (p=0.036) and DeepSeek-V3.1 (p < 0.001); in SMOG, Gemini outperformed DeepSeek-V3.1 (p=0.003); and in CLI, ChatGPT-5 was higher than DeepSeek-V3.1 (p=0.004). No significant correlation was found between readability and understandability (p>0.05).


Conclusion: DeepSeek-V3.1 outperformed ChatGPT-5 and Gemini 2.5 Flash in terms of accuracy. While all models produced similar scores above the PEMAT-P understandability threshold (70%), there were significant differences in readability metrics; furthermore, no model consistently reached the recommended 6th–8th grade level.


 


How to cite this article:


Taşyürek M, Adıgüzel Ö, Gündoğar M, Goncharuk-Khomyn M, Ortaç H. Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability. Int Dent Res. 2025;15(3) (Advanced Online). https://doi.org/10.5577/intdentres.662

Full text article

Generated from XML file

Authors

Makbule Taşyürek
makbuletasyurek63@gmail.com (Primary Contact)
Özkan Adıgüzel
Mustafa Gündoğar
Myroslav Goncharuk-Khomyn
Hatice Ortaç
Taşyürek, M., Adıgüzel, Özkan, Gündoğar, M., Goncharuk-Khomyn, M., & Ortaç, H. (2025). Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability. International Dental Research, 15(Advanced Online). https://doi.org/10.5577/intdentres.662

Article Details

How to Cite

Taşyürek, M., Adıgüzel, Özkan, Gündoğar, M., Goncharuk-Khomyn, M., & Ortaç, H. (2025). Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability. International Dental Research, 15(Advanced Online). https://doi.org/10.5577/intdentres.662
Smart Citations via scite_

Similar Articles

You may also start an advanced similarity search for this article.

Most read articles by the same author(s)

1 2 3 4 > >> 

Accuracy comparison of chatbot responses in temporomandibular joint disorders

Esra Nur Avukat, Mirac Berke Topcu Ersöz , Canan Akay
Abstract View : 0