Pure Storage’s VP on Why Data Quality and Engineering Are Critical for AI Success

AI success depends on quality data, not just hardware. Continuous data engineering ensures accurate, relevant training sets for effective AI workloads.

Published on: Jun 25, 2025
Pure Storage’s VP on Why Data Quality and Engineering Are Critical for AI Success

Interview: Pure Storage on the AI Data Challenge Beyond Hardware

Successfully running artificial intelligence (AI) workloads involves more than just adding compute and storage power. While having enough processing capacity and fast data supply is essential, the true foundation lies in the quality of data used for AI training. Par Botes, vice-president of AI infrastructure at Pure Storage, highlights this critical point during a recent discussion at the company's Accelerate event in Las Vegas.

Botes stresses the importance for enterprises to capture, organise, prepare, and align data effectively. Often, data can be incomplete or unsuitable for the specific problems AI is designed to solve. We explore with him the role of data engineering, management, and the use of data lakehouses to ensure datasets truly meet AI needs.

Key Storage Challenges in AI

Botes points out that creating AI systems capable of solving real problems depends heavily on how well data is organised and prepared. It’s not just about raw hardware; it’s about ensuring data flows efficiently to GPUs, which demand extremely high bandwidth to operate at full capacity.

“The hardest operational challenge is feeding GPUs data fast enough,” he explains. While high-end solutions are emerging, many enterprises face new systems and skills gaps when adopting AI workloads.

Why Are These Challenges Hard?

Feeding data to GPUs is just one part of the puzzle. Equally important is the task of preparing data: knowing where relevant data resides, assessing its accuracy and completeness, and tracking its lineage. This includes understanding exactly which datasets train which models and ensuring no critical data is missing.

“It’s not a scientific problem but an operational one,” Botes says. Many companies lack experience in these data practices, making it difficult to maintain data integrity in AI projects.

Does This Vary by Customer or Workload?

Yes. Some organisations have the expertise to know they possess the necessary data, while others may be uncertain. Botes shares his experience from autonomous vehicle projects, where performance issues revealed gaps in the training data for specific road conditions.

He stresses the need for principled methods to reason about data completeness and coverage, a practice not yet widespread outside top AI training companies.

How Can Customers Address These Issues?

Botes recommends focusing on data engineering processes. This involves partnering with companies that specialise in data lakehouses — systems that ingest, clean, and prepare incoming data for AI training. By applying a disciplined approach to data engineering, organisations can ensure data is ready for AI workloads.

What Does Data Engineering Entail?

At its core, data engineering involves accessing datasets from various corporate databases and structured systems, ingesting them into a common format like a lakehouse, and transforming this data to create accurate, relevant training sets. It has become a distinct field requiring specialised skills.

Supporting Data Lakehouses with Storage

Cloud providers often offer data lakehouse solutions, while on-premise setups rely on system integrators and vendors. Pure Storage works with these partners to deliver integrated solutions combining lakehouse platforms with fast, reliable storage infrastructure. The goal is to connect these storage systems smoothly to AI training environments.

Is Data Engineering a One-Time Task?

Not at all. Data engineering and storage are closely linked and form an ongoing cycle. As AI systems consume data through methods like retrieval augmented generation (RAG) or fine-tuning, new data must be continuously recorded, transformed, and incorporated back into the system. Models evolve in tandem with the data, making this a continuous process.

Botes highlights the importance of tracking data lineage—understanding where data originates and how it’s used. This supports quality assurance and future training cycles, creating an “AI flywheel” where data ingestion and computation are in constant rotation.

Additional Advice for Customers

  • Know your data: Understand what your data represents and identify any gaps. AI can fill these gaps, but incorrect filling leads to hallucinations—false or misleading outputs.
  • Start simple: Even if using basic cloud AI services, begin by logging inputs and outputs. This forms the foundation of effective data management.
  • Build data management early: Organising data from the start prepares your organisation for more advanced AI projects down the line.

Botes’ insights underline that practical, continuous data engineering and management are just as vital as hardware in achieving AI success.

For those looking to build skills in AI data handling and infrastructure, exploring dedicated training can be a strong starting point. Check out Complete AI Training's courses by skill to deepen your understanding.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide