Metadata management is now the make-or-break for AI at scale
AI isn't stalling because of models. It's stalling because leaders can't see, trust or reuse their data at pace. In a discussion at the Future of Data Platforms Summit, Frederic Van Haren, CTO and founder of HighFens, put it simply: without usable metadata, scaling AI increases cost and complexity faster than it increases value.
Teams are collecting more data than they can store and process. Relevance is contextual. What's "good" for one model can ruin another. Without lineage, usage and intent captured as metadata, teams guess. Guessing slows iteration and drives waste.
Why this matters to management
- Cost control: Storage, egress and GPU time spike when you can't find or trust the right data.
- Speed: Engineers spend cycles searching, reconciling and reprocessing instead of shipping features.
- Risk: Missing lineage and ownership make compliance, audits and incident response painful.
- Outcome quality: Data that improves one use case can degrade another; metadata clarifies fit-for-purpose.
Formats help. Metadata wins.
Best-of-breed stacks and open formats reduce friction. But format conversion still burns time. Many teams spend close to half their effort massaging data from A to B at petabyte scale. That work doesn't compound value.
As training and inference split across on-prem, cloud and edge (often on Nvidia-powered platforms), metadata becomes the only consistent layer. It ties together where data is created, how it's processed and where it's reused over time. Formats move bytes. Metadata moves decisions.
Executive 90-day playbook
- Stand up the metadata layer: Pick a catalog and lineage backbone. Publish a business glossary. Require owners for critical datasets.
- Instrument pipelines: Capture lineage, schema changes, usage telemetry and model-data relationships for top AI workloads.
- Define "fit-for-purpose" quality: Replace vague "good data" with use-case criteria: freshness, coverage, bias thresholds and SLAs.
- Make costs visible: Tag datasets and jobs with cost centers. Track storage, egress and GPU hours per project and per model version.
- Set lifecycle policies: Tier, archive or drop low-value data. Don't pay to store what you won't process.
- Create a small "metadata ops" squad: Data engineering, MLOps and governance working from one backlog and shared KPIs.
What to measure each quarter
- Time-to-discover and approve datasets for a new model (days to hours).
- % of critical datasets with owner, SLA and complete lineage.
- Model retrain cycle time and success rate (fewer blocked runs due to data issues).
- Cost per successful training run and per 1,000 inferences.
- Reprocessing hours per TB and number of duplicate datasets eliminated.
- Volume of data archived or dropped vs. newly added (net useful growth).
Common pitfalls to avoid
- Buying tools before defining ownership, policies and KPIs.
- Treating metadata as a one-time project instead of a product with a roadmap.
- Ignoring inference: tracing only training pipelines leaves a blind spot in production.
- Copying data across clouds to "make it easy" instead of cataloging and virtualizing access.
- Over-focusing on format conversion instead of reducing it with standards and contracts.
Practical tech notes from the field
Think in two layers: metadata (what it is, where it is, who uses it, why it exists) and the actual bits and bytes. Expect increasing focus on the first layer through 2026. As your footprint spreads-training on-prem, inference in cloud or at the edge-keep metadata unified so teams share context, not assumptions.
Recommended starting points
- Adopt an open lineage standard to make cross-tool visibility possible. See OpenLineage.
- For edge and cloud inference planning, review platform guidance from Nvidia TensorRT.
- Level up skills across roles with focused training paths: AI courses by job function.
The takeaway
AI scale is a metadata problem disguised as a compute problem. The teams that win will treat metadata as a first-class product, wire it into daily work and fund it with the same seriousness as their models and GPUs. Do that, and you convert data growth into business outcomes-without setting fire to your budget.
Your membership also unlocks: