AI in Health Care: How Large Language Models Stack Up in Diagnosing, Coding, and Risk Prediction

AI models show high accuracy in generating diagnoses but struggle with medical coding and readmission risk prediction. Combining AI with human oversight is essential for safe healthcare use.

Categorized in: AI News Healthcare
Published on: Aug 07, 2025
AI in Health Care: How Large Language Models Stack Up in Diagnosing, Coding, and Risk Prediction

AI and Health Care: Assessing Patient Risks and Medical Coding with Large-Language Models

Journal of Medical Internet Research evaluated how various LLMs perform essential clinical tasks, including diagnosis generation, medical coding, and hospital readmission risk assessment.

Evaluating Clinical Tasks with AI

The study focused on five LLMs: DeepSeek-R1 and OpenAI-O3 (reasoning models), alongside ChatGPT-4, Gemini-1.5, and LLaMA-3.1 (non-reasoning models). Researchers tested these models using 300 hospital discharge summaries, providing them with structured clinical data such as chief complaints, medical and surgical history, lab results, and imaging reports.

The goal was to see if these AI systems could accurately:

  • Generate primary diagnoses
  • Predict ICD-9 medical codes
  • Stratify hospital readmission risk

How Did the Models Perform?

Primary Diagnosis Generation

This was the strongest area for LLMs. Among non-reasoning models, LLaMA-3.1 led with an 85% accuracy rate, followed closely by ChatGPT-4 at 84.7%. OpenAI-O3, a reasoning model, outperformed all with a 90% accuracy rate.

ICD-9 Medical Code Prediction

All models struggled here. Accuracy dropped sharply, with LLaMA-3.1 at 42.6%, ChatGPT-4 at 40.6%, and Gemini-1.5 lagging at just 14.6%. The reasoning model OpenAI-O3 scored 45.3%, slightly better but still limited.

Hospital Readmission Risk Stratification

Risk prediction was challenging. Non-reasoning models hovered around 33-41% accuracy. Reasoning models performed better, with DeepSeek-R1 achieving 72.66% and OpenAI-O3 close behind at 70.66%.

Key Insights for Healthcare Professionals

While LLMs show promising capabilities, especially in diagnosis generation, their current limitations in medical coding and risk prediction mean they cannot replace human expertise. Errors in coding can lead to billing mistakes and flawed analytics, while inaccurate readmission risk assessments may impact patient safety and discharge planning.

Liability and transparency remain concerns when AI systems make mistakes or generate misleading information, raising questions about accountability among developers, clinicians, and healthcare providers.

The Path Forward

Reasoning models generally outperformed non-reasoning ones, offering better interpretability and marginally improved accuracy. Still, reliability issues persist. The study suggests that improving LLM performance requires:

  • Task-specific fine-tuning on clinical datasets
  • Hybrid human-AI workflows to combine strengths
  • Bias detection and correction mechanisms
  • Continuous monitoring and governance frameworks

Future research will explore repeated trials, larger datasets, and further tuning to boost stability and safety in clinical applications.

Balancing Optimism with Caution

As AI continues to influence healthcare delivery, it’s crucial to integrate these tools responsibly. Combining AI with human oversight can improve clinical workflows, but safeguards must protect patient safety and maintain trust.

Healthcare professionals interested in expanding their knowledge of AI applications in clinical settings can explore specialized training and courses. For practical AI skills and certification options, visit Complete AI Training.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)