AI and Health Care: Assessing Patient Risks and Medical Coding with Large-Language Models
Evaluating Clinical Tasks with AI
The study focused on five LLMs: DeepSeek-R1 and OpenAI-O3 (reasoning models), alongside ChatGPT-4, Gemini-1.5, and LLaMA-3.1 (non-reasoning models). Researchers tested these models using 300 hospital discharge summaries, providing them with structured clinical data such as chief complaints, medical and surgical history, lab results, and imaging reports.
The goal was to see if these AI systems could accurately:
- Generate primary diagnoses
- Predict ICD-9 medical codes
- Stratify hospital readmission risk
How Did the Models Perform?
Primary Diagnosis Generation
This was the strongest area for LLMs. Among non-reasoning models, LLaMA-3.1 led with an 85% accuracy rate, followed closely by ChatGPT-4 at 84.7%. OpenAI-O3, a reasoning model, outperformed all with a 90% accuracy rate.
ICD-9 Medical Code Prediction
All models struggled here. Accuracy dropped sharply, with LLaMA-3.1 at 42.6%, ChatGPT-4 at 40.6%, and Gemini-1.5 lagging at just 14.6%. The reasoning model OpenAI-O3 scored 45.3%, slightly better but still limited.
Hospital Readmission Risk Stratification
Risk prediction was challenging. Non-reasoning models hovered around 33-41% accuracy. Reasoning models performed better, with DeepSeek-R1 achieving 72.66% and OpenAI-O3 close behind at 70.66%.
Key Insights for Healthcare Professionals
While LLMs show promising capabilities, especially in diagnosis generation, their current limitations in medical coding and risk prediction mean they cannot replace human expertise. Errors in coding can lead to billing mistakes and flawed analytics, while inaccurate readmission risk assessments may impact patient safety and discharge planning.
Liability and transparency remain concerns when AI systems make mistakes or generate misleading information, raising questions about accountability among developers, clinicians, and healthcare providers.
The Path Forward
Reasoning models generally outperformed non-reasoning ones, offering better interpretability and marginally improved accuracy. Still, reliability issues persist. The study suggests that improving LLM performance requires:
- Task-specific fine-tuning on clinical datasets
- Hybrid human-AI workflows to combine strengths
- Bias detection and correction mechanisms
- Continuous monitoring and governance frameworks
Future research will explore repeated trials, larger datasets, and further tuning to boost stability and safety in clinical applications.
Balancing Optimism with Caution
As AI continues to influence healthcare delivery, itβs crucial to integrate these tools responsibly. Combining AI with human oversight can improve clinical workflows, but safeguards must protect patient safety and maintain trust.
Healthcare professionals interested in expanding their knowledge of AI applications in clinical settings can explore specialized training and courses. For practical AI skills and certification options, visit Complete AI Training.
Your membership also unlocks: