Performance of ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.2 large language models in response to questions about ozone in dentistry
Abstract
Aim: The aim of this study was to compare and evaluate the accuracy and completeness of responses to open-ended questions about the use of ozone in dentistry that were addressed to three recent large language models (LLMs): ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.2.
Methodology: A total of 25 open-ended questions on the subject of ozone in dentistry were asked to three different LLMs (ChatGPT-5, Gemini 2.5 Flash, and DeepSeek-V3.2). The chat history was deleted after each response, and a new session was started for each question. All questions were asked only once using the same computer IP address, on the same day, on the same fixed fiber Internet network. The questions were asked to the chatbots in English, and English responses were obtained and recorded. The responses were evaluated independently by two restorative dentistry specialists using two separate Likert scales for accuracy and completeness. In the statistical analyses, the Intraclass Correlation Coefficient (ICC) was calculated, and the Shapiro–Wilk, Kruskal–Wallis, Dunn–Bonferroni, and Spearman tests were applied.
Results: The agreement between the evaluators was found to be high in respect of the accuracy points and varied according to the model for the completeness points. Significant differences were determined between the LLMs in the comparisons of the accuracy scores (p < 0.001). The median points of the ChatGPT-5 and DeepSeek-V3.2 models were determined to be statistically significantly higher than those of Gemini 2.5 Flash (p = 0.002 and p < 0.001, respectively). Significant differences were observed between the LLMs in terms of completeness scores (p = 0.004). In the subgroup analyses, the mean points of the ChatGPT-5 and DeepSeek-V3.2 models were determined to be statistically significantly higher than those of Gemini 2.5 Flash (p = 0.006 and p = 0.021, respectively).
Conclusion: The responses to questions about the use of ozone in dentistry showed differences among the LLMs in terms of accuracy and completeness. ChatGPT-5 and DeepSeek-V3.2 demonstrated better performance than Gemini 2.5 Flash. Despite the high accuracy rates, it must not be forgotten that incorrect healthcare information can lead to serious outcomes. Therefore, it should be recommended that LLMs be used under human supervision in clinical applications and that future research focus on increasing the diagnostic reliability of artificial intelligence.
Full text article
Authors
Copyright © 2025 International Dental Research

This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.