AI-Driven Models Deliver Unprecedented Accuracy in Predicting Data Storage System Performance
Researchers developed AI models to predict data storage system performance with 4–16% error using real-world data. This aids efficient data center management and reduces costly tests.

AI Enables Precise Modelling of Data Storage System Performance
27 Jun 2025
Data storage systems are critical for safeguarding and quickly accessing vast amounts of information. These systems consist of multiple components such as controllers, HDDs, SSDs, and cache memory that coordinate to deliver efficient operation.
To optimize performance, it's essential to accurately predict system behavior under varying workloads. Researchers at the HSE Faculty of Computer Science have developed a novel approach using generative machine learning models to forecast key performance metrics like input/output operations per second (IOPS) and latency with high precision.
Two-Stage Modelling Approach
The modelling process involves two main steps. First, performance data is collected by testing the system under different loads and configurations. This dataset is then used to train two specialized generative models:
- CatBoost regression model: Excels with tabular data and predicts average performance values and deviations accurately.
- Normalizing flow model: Generates a full distribution of possible outcomes, capturing uncertainties and variability in the data.
This combination provides a comprehensive understanding of how the system performs across scenarios.
Advantages of the Approach
Unlike methods that require detailed knowledge of system internals—which is often restricted due to manufacturer confidentiality—this method trains models directly on real-world measurement data. For example, the study leveraged 300,000 performance measurements to build versatile models applicable to any data storage system.
Validation was performed using Little's law from queuing theory, confirming the model's accuracy. Prediction errors were low, ranging from 4–10% for IOPS and 3–16% for latency, with correlation coefficients reaching 0.99 compared to observed values.
Practical Implications
This approach supports efficient data center management by enabling:
- Reliable prediction of system behavior under varying loads
- Early identification of potential performance bottlenecks
- Optimization of power consumption
- Reduction in costly physical experiments for performance evaluation
The researchers have made their experimental code and performance measurements publicly available, promoting transparency and further development.
This work was conducted as part of the Mirror Laboratories project at HSE University, focusing on improving data center and storage efficiency using artificial intelligence.
For those interested in advancing AI skills for similar applications, exploring targeted courses on Complete AI Training can provide valuable knowledge.