Explainable AI for Accurate Differentiation of Voice Disorders Through Acoustic Analysis
AI models analyzed voice recordings to classify disorders with 99.44% accuracy using explainable techniques. This aids non-invasive, transparent diagnosis supporting clinicians.

Differentiability of Voice Disorders through Explainable AI
The voice reflects various health conditions, and detecting disorders early is crucial for effective treatment. Traditional phoniatric exams involve acoustic analysis of vocal signals, but these require specialist equipment and expertise. Recent advances in AI offer promising alternatives by analyzing voice recordings to identify pathologies automatically. This article explores a study where deep learning and explainable AI techniques were applied to classify voice disorders with high accuracy and transparency.
Voice Disorders and Their Categories
Voice disorders arise from anatomical, functional, or paralytic issues affecting voice production. They can be broadly grouped into three categories relevant to this study:
- Hyperkinetic Dysphonia: Characterized by excessive muscular contraction leading to strained, labored voice quality. Conditions include vocal cord nodules, polyps, and Reinke’s edema.
- Hypokinetic Dysphonia: Caused by reduced vocal fold closure resulting in breathy, weak voice. Includes vocal fold paralysis, glottic insufficiency, and laryngitis.
- Reflux Laryngitis: Inflammation from gastric acid reflux causing chronic hoarseness and other symptoms.
Diagnosis usually involves laryngoscopy, an invasive procedure to inspect vocal fold anatomy. Acoustic analysis offers a non-invasive alternative by measuring voice features from recorded sounds.
Data and Methods
The study used the publicly available VOICED dataset, which contains recordings from 208 adults—150 with voice disorders and 58 healthy controls. Each participant provided a 5-second recording of the vowel /a/, captured with a mobile phone microphone in controlled conditions.
Recordings were pre-processed to remove noise using a low-pass FIR filter with a Hanning window. Each 5-second audio sample was split into overlapping 250 ms segments, generating 36 segments per recording. These segments were converted into Mel spectrograms, a time-frequency representation that aligns with human auditory perception.
For classification, transfer learning was applied with three pre-trained convolutional neural networks (CNNs): OpenL3, Yamnet, and VGGish. Models were fine-tuned on the 8-class problem, which includes seven voice disorder categories plus healthy voices. The dataset was split 70/30 for training and testing, with 5-fold cross-validation used to ensure robustness.
Explainable AI (XAI) for Transparent Diagnosis
To avoid the black-box nature of deep networks, the study used an explainability technique called Occlusion Sensitivity. This method systematically masks parts of the input spectrogram and measures how the model's confidence changes. By averaging these sensitivity maps across samples, researchers identified which time-frequency regions the model relied on for classification.
This approach introduces the concept of differentiability, describing how distinct the features of different voice disorders are from the model’s perspective. Understanding these discriminative features aids clinicians in trusting AI decisions and may reveal new acoustic biomarkers for various pathologies.
Results
The OpenL3 model achieved the highest accuracy of 99.44% across all eight classes. While some classes like Glottic Insufficiency had slightly lower precision (~98.2%), overall performance remained excellent. Yamnet and VGGish also performed well but with marginally lower accuracy.
Explainability maps highlighted that the model classifies voices based on the presence or absence of specific frequency patterns and vocal intensities within short 250 ms windows. These insights confirm that the model leverages physiologically relevant features rather than arbitrary cues.
Implications and Future Directions
This work demonstrates that combining transfer learning with explainability methods can produce highly accurate and interpretable voice disorder classifiers. Such tools can support clinicians by offering rapid, non-invasive screening, especially useful in telemedicine or resource-limited settings.
Although AI-based diagnosis does not replace specialist consultation or laryngoscopy, it provides valuable decision support and verification. Moreover, voice analysis as a biomarker extends beyond voice disorders; similar techniques could aid in detecting diseases like Parkinson’s or type 2 diabetes from vocal patterns.
Access to Data and Code
The VOICED dataset is publicly available for research purposes at PhysioNet. The related source code for generating Mel spectrograms, transfer learning models, and explainability maps can be found at Zenodo.
For those interested in expanding their AI knowledge in healthcare and related fields, comprehensive courses and resources are available at Complete AI Training.