AI Chatbots in Answering Questions Related to Ocular Oncology: A Comparative Study Between DeepSeek v3, ChatGPT-4o, and Gemini 2.0
Abstract
Background: AI chatbots are increasingly used in healthcare for sharing information and supporting clinical decisions. Yet, their reliability in specialized fields like ocular oncology remains unclear. This study compares three AI chatbots—ChatGPT-4o (OpenAI), DeepSeek v3 (DeepSeek), and Gemini 2.0 (Google DeepMind)—on their ability to answer clinically relevant questions about ocular malignancies.
Methods: At a tertiary eye care center in Northern India, five standardized clinical questions covering key ocular oncology conditions were posed to each chatbot. Experts evaluated responses for correctness, completeness, readability, presence of irrelevant content, applicability in Indian healthcare, and reliability. Statistical analysis included Kruskal-Wallis and ANOVA tests.
Results: All chatbots showed similar accuracy (mean correctness 3.4/4), but most responses (4 of 5 per chatbot) lacked completeness. DeepSeek v3 produced the longest and most readable answers, while ChatGPT-4o offered shorter but more reliable responses. Gemini 2.0’s responses varied widely in length and structure. None included irrelevant data. Applicability to Indian clinical practice was limited, with only a few answers suitable for direct use.
Conclusion: AI chatbots can provide factually accurate but often incomplete and insufficiently localized answers in ocular oncology. ChatGPT-4o balanced reliability and conciseness best. However, these models require regional customization and expert oversight before use in sensitive clinical scenarios.
Introduction
Artificial intelligence (AI) has become a significant tool in healthcare, with applications ranging from patient education to clinical decision support. Among AI tools, chatbots like ChatGPT, DeepSeek, and Gemini have gained popularity for answering medical queries. However, the accuracy and clinical value of their responses, especially in niche fields like ocular oncology, remain underexplored.
Patients and medical students increasingly rely on these AI chatbots for information, yet the authenticity and clinical relevance of the data provided are often unverified. This study addresses the gap by assessing how well these chatbots handle complex ocular oncology questions, focusing on correctness, completeness, readability, and real-world applicability.
Materials & Methods
A cross-sectional observational study was performed at a tertiary eye institute in Northern India. Ocular oncology experts formulated five detailed clinical questions involving choroidal melanoma, ocular surface squamous neoplasia, retinoblastoma, sebaceous cell carcinoma, and basal cell carcinoma. Each question included patient demographics, clinical history, and diagnostic information relevant for management planning.
These questions were entered into DeepSeek v3, ChatGPT-4o, and Gemini 2.0. Responses were collected and evaluated independently by two faculty members using a structured proforma. Parameters assessed included:
- Correctness: Accuracy based on current evidence.
- Completeness: Inclusion of all necessary clinical details.
- Readability: Word count, sentence count, and Flesch-Kincaid ease scores.
- Irrelevant data: Presence or absence of unrelated content.
- Applicability: Suitability for direct use in Indian healthcare settings.
- Reliability: Overall trustworthiness of the response.
Data were analyzed using IBM SPSS with Kruskal-Wallis and ANOVA tests.
Results
Correctness
All three chatbots scored similarly in accuracy (mean 3.4 out of 4), indicating consistent factual correctness across models.
Completeness
Most responses (80%) lacked key clinical details such as differential diagnoses or staging, vital for treatment planning.
Readability and Length
- DeepSeek v3 produced the longest answers (average 534 words) with the highest readability score (Flesch ease 38.0), suggesting clearer, more accessible text.
- ChatGPT-4o generated the shortest responses (average 287 words) but maintained better clinical focus.
- Gemini 2.0 showed high variability in length and sentence structure, reflecting inconsistency.
Irrelevant Content
No irrelevant information was found in any chatbot responses, showing focus on the clinical questions.
Applicability in Indian Healthcare
Only 2 out of 5 ChatGPT-4o responses and 1 out of 5 from each of the other chatbots were directly usable in Indian clinical settings without modification.
Reliability
- ChatGPT-4o had the highest average reliability score (3.2/4).
- Gemini 2.0 and DeepSeek v3 scored slightly lower (3.0 and 2.8 respectively), with DeepSeek showing more variability.
Discussion
AI chatbots hold promise for medical education and patient awareness but face challenges in specialized fields like ocular oncology. The comparable correctness across models suggests they can provide accurate information, yet the frequent incompleteness raises concerns about their clinical usefulness.
DeepSeek v3’s verbose and readable answers may aid patient understanding but risk including superfluous or non-specific content. ChatGPT-4o offers a more balanced approach, delivering concise and clinically relevant information, which is crucial for healthcare professionals and informed patients.
A major limitation is the lack of regional customization. Indian clinical protocols and resource constraints are rarely reflected, limiting chatbot recommendations’ applicability. This gap underscores the need to integrate localized guidelines and socioeconomic factors into AI training data.
Inconsistencies in response structure, especially with Gemini 2.0, can undermine user trust. Uniform formatting and transparency about data sources would enhance credibility.
Previous studies on AI in oncology and other medical specialties echo these findings, highlighting accuracy without sufficient depth or contextual relevance. The risk of incomplete or outdated guidance underscores the necessity of human oversight.
For AI chatbots to support clinical decisions safely, they require:
- Fine-tuning with domain-specific, peer-reviewed datasets.
- Integration with real-time medical databases and regional guidelines.
- Localization for economic, cultural, and infrastructural realities.
- Human-in-the-loop validation before clinical deployment.
Ongoing development in AI language models and new entrants like Grok4 and Perplexity call for continuous benchmarking to monitor improvements in reliability and relevance.
Limitations
This study reflects a snapshot based on single-day, single-prompt interactions, which may not capture model updates or variability with different inputs. Follow-up studies should include larger question sets, diverse data formats, and testing under ambiguous or adversarial conditions to better assess chatbot robustness.
Conclusions
The comparative evaluation reveals that while ChatGPT-4o, DeepSeek v3, and Gemini 2.0 can produce accurate ocular oncology information, significant gaps remain in completeness and applicability, especially in Indian clinical contexts.
DeepSeek v3 excels in producing detailed and readable content but lags in reliability. ChatGPT-4o balances accuracy and clinical relevance better but offers briefer responses. Gemini 2.0 shows inconsistent performance.
None of the current models are ready for unsupervised use in high-stakes ocular oncology scenarios. Continued refinement, regional adaptation, and expert supervision are essential before integrating AI chatbots into clinical practice.
For professionals interested in expanding their AI expertise, relevant courses and training can be found at Complete AI Training.
Your membership also unlocks:
 
             
             
                            
                            
                           