Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability
Abstract
Aim: The aim of this study is to compare three large language models (LLMs) (ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1) in terms of accuracy, understandability, and readability, based on the answers provided to frequently asked endodontic questions.
Methodology: Thirty open-ended frequently asked questions were generated using the AlsoAsked and AnswerThePublic websites. Two experienced endodontists scored the accuracy of the responses using a 5-point Likert scale. The understandability of the responses was analyzed using the Patient Education Materials Assessment Tool for Printed Materials (PEMAT-P) tool. Readability was assessed using the Flesch-Kincaid Readability Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), and Coleman-Liau Index (CLI). Group comparisons were performed using the ANOVA/Kruskal-Wallis test, followed by post-hoc Dunn-Bonferroni tests.
Results: Inter-rater agreement was excellent (accuracy ICC: 0.908–0.917; reliability ICC: 0.992–0.995; all p<0.001). A significant difference was found between the models in terms of accuracy (p<0.001): DeepSeek-V3.1 (4.63±0.81) scored the highest and performed significantly better than ChatGPT-5 (3.93±0.79) and Gemini 2.5 Flash (3.67±0.76). No significant difference between ChatGPT-5 and Gemini 2.5 Flash (p>0.05). The understandability (PEMAT-P) scores were similar (p=0.683), and all models scored above 70% (ChatGPT-5, 77.46%; Gemini, 76.04%; DeepSeek-V3.1, 77.57%). Differences were found in readability metrics: DeepSeek-V3.1 scored higher than ChatGPT-5 in FRES (p=0.044); Gemini scored higher than DeepSeek-V3.1 in FKGL (p=0.001); in GFI, Gemini 2.5 Flash scored higher than both ChatGPT-5 (p=0.036) and DeepSeek-V3.1 (p < 0.001); in SMOG, Gemini outperformed DeepSeek-V3.1 (p=0.003); and in CLI, ChatGPT-5 was higher than DeepSeek-V3.1 (p=0.004). No significant correlation was found between readability and understandability (p>0.05).
Conclusion: DeepSeek-V3.1 outperformed ChatGPT-5 and Gemini 2.5 Flash in terms of accuracy. While all models produced similar scores above the PEMAT-P understandability threshold (70%), there were significant differences in readability metrics; furthermore, no model consistently reached the recommended 6th–8th grade level.
How to cite this article:
Taşyürek M, Adıgüzel Ö, Gündoğar M, Goncharuk-Khomyn M, Ortaç H. Comparative evaluation of the responses from ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.1 chatbots to patient inquiries about endodontic treatment in terms of accuracy, understandability, and readability. Int Dent Res. 2025;15(3) (Advanced Online). https://doi.org/10.5577/intdentres.662
Full text article
Authors
Copyright © 2025 International Dental Research

This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.