Performance of ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.2 large language models in response to questions about ozone in dentistry

Suzan Cangül(1), Özkan Adıgüzel(2), Makbule Taşyürek(3), Tuba Tunç(4), Hatice Ortaç(5)
(1) Dicle University, Faculty of Dentistry, Department of Restorative Dentistry, Diyarbakır, Türkiye,
(2) a) Dicle University, Faculty of Dentistry, Department of Endodontics, Diyarbakır, Türkiye. b) Nelcotech R&D Center, Department of Dentistry, Sheridan, Wyoming, USA,
(3) Dicle University, Faculty of Dentistry, Department of Endodontics, Diyarbakır, Türkiye,
(4) Dicle University, Faculty of Dentistry, Department of Restorative Dentistry, Diyarbakır, Türkiye,
(5) Dicle University, Faculty of Medicine, Department of Biostatistics, Diyarbakır, Türkiye

Abstract

Aim: The aim of this study was to compare and evaluate the accuracy and completeness of responses to open-ended questions about the use of ozone in dentistry that were addressed to three recent large language models (LLMs): ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.2.


Methodology: A total of 25 open-ended questions on the subject of ozone in dentistry were asked to three different LLMs (ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.2). The chat history was deleted after each response, and a new session was started for each question. All questions were asked only once using the same computer IP address, on the same day, on the same fixed fiber Internet network. The questions were asked to the chatbots in English, and English responses were obtained and recorded. The responses were evaluated independently by two restorative dentistry specialists using two separate Likert scales for accuracy and completeness. In the statistical analyses, the Intraclass Correlation Coefficient (ICC) was calculated, and the Shapiro–Wilk, Kruskal–Wallis, Dunn–Bonferroni, and Spearman tests were applied.


​Results: The agreement between the evaluators was found to be high in respect of the accuracy points and varied according to the model for the completeness points. Significant differences were determined between the LLMs in the comparisons of the accuracy scores (p < 0.001). The median points of the ChatGPT-5 and DeepSeek-V3.2 models were determined to be statistically significantly higher than those of Gemini 2.5 Flash (p = 0.002 and p < 0.001, respectively). Significant differences were observed between the LLMs in terms of completeness scores (p = 0.004). In the subgroup analyses, the mean points of the ChatGPT-5 and DeepSeek-V3.2 models were determined to be statistically significantly higher than those of Gemini 2.5 Flash (p = 0.006 and p = 0.021, respectively).


Conclusion: The responses to questions about the use of ozone in dentistry showed differences among the LLMs in terms of accuracy and completeness. ChatGPT-5 and DeepSeek-V3.2 demonstrated better performance than Gemini 2.5 Flash. Despite the high accuracy rates, it must not be forgotten that incorrect healthcare information can lead to serious outcomes. Therefore, it should be recommended that LLMs be used under human supervision in clinical applications and that future research focus on increasing the diagnostic reliability of artificial intelligence.

Full text article

Generated from XML file

Authors

Suzan Cangül
suzanbali@outlook.com (Primary Contact)
Özkan Adıgüzel
Makbule Taşyürek
Tuba Tunç
Hatice Ortaç
1.
Cangül S, Adıgüzel Özkan, Taşyürek M, Tunç T, Ortaç H. Performance of ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.2 large language models in response to questions about ozone in dentistry. Int Dent Res. 2025;15(3):165-174. doi:10.5577/intdentres.701

Article Details

How to Cite

1.
Cangül S, Adıgüzel Özkan, Taşyürek M, Tunç T, Ortaç H. Performance of ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.2 large language models in response to questions about ozone in dentistry. Int Dent Res. 2025;15(3):165-174. doi:10.5577/intdentres.701
Smart Citations via scite_

Similar Articles

You may also start an advanced similarity search for this article.

Most read articles by the same author(s)

1 2 3 4 > >> 

Evaluation of consultation contents directed from a university training and research hospital to an oral and maxillofacial surgery clinic

Sefa Çolak, Mustafa Sami Demirsoy, Aras Erdil, Göksal Keldal, Ahmet Altan, Nihat Akbulut
Abstract View : 1393
Download :327

Accuracy comparison of chatbot responses in temporomandibular joint disorders

Esra Nur Avukat, Mirac Berke Topcu Ersöz , Canan Akay
Abstract View : 0

Evaluation of student attitudes toward clinical education in pediatric dentistry

Meltem Karahan, Banu Çiçek Tez, Yiğit Ağlamaz, Bahar Başak Kızıltan Eliaçık
Abstract View : 0

Accuracy comparison of chatbot responses in temporomandibular joint disorders

Esra Nur Avukat, Mirac Berke Topcu Ersöz , Canan Akay
Abstract View : 0