Table representation learning: solving the structured data paradox
Most organisations store years of structured data in relational databases and spreadsheets. It's tidy and searchable, yet the majority sits unused. As researcher Madelon Hulsebos puts it, "we don't know what we don't know." Her work at the Centrum Wiskunde & Informatica (CWI) shows why we miss insights-and how to fix it.
Hulsebos leads the Table Representation Learning Lab at CWI with a team of PhD students, postdocs and master's students. After seeing data scientists repeatedly clean, link and rework the same tables, she built methods that let AI understand what tables mean, not just how they're labeled.
The core issue: structure isn't the problem-schema diversity is
Structured data was supposed to be "easy." It isn't. Each system uses different column names, units, and conventions. Keyword search, SQL snippets and pattern matching break down when context matters.
Table representation learning addresses that gap. Instead of matching strings, models learn the semantics of columns and tables-what "city," "place," and "municipality" actually refer to in context-and generalise across systems to find what's relevant.
From information retrieval to insight retrieval
Finding the right table is step one. Turning that into an answer usually requires joins, transformations, validation and feature engineering. Hulsebos calls this shift "insight retrieval": automatically assembling the data needed to answer a question-and explaining why the answer is credible.
Full automation isn't the goal. People need transparency and a clear path to audit the steps taken. Explanations, iteration and provenance matter as much as accuracy.
DataLibra (2024-2029): making structured data as searchable as the web
With support including an NWO AiNed Fellowship Grant, Hulsebos launched DataLibra to deliver both research and usable tools. The aim: let anyone ask a plain-language question and retrieve the right data across systems-without knowing SQL or the underlying schema.
This directly tackles the 80/20 split in data science. Automate the 80%-data cleaning, validation, linking, and transformation-so experts can focus on modelling, ethics and decision-making. It also broadens access, giving non-specialists a way to get answers without waiting on dashboards or bespoke pipelines.
Why most current "AI for data" features fall short
Many tools promise natural-language-to-SQL and autonomous analysis. Benchmarks tell a different story. Demos often look good, but success rates on real tasks can be near zero. The missing ingredients are reliability, context, and explanations you can trust.
LLMs will always give an answer. The real test is whether the system can show its reasoning, cite sources, and align with policy and business rules.
Case study: sensitive data detection for the United Nations
The Humanitarian Data Centre supports crisis response by sharing datasets via the Humanitarian Data Exchange. The risk: "sensitive" isn't just personal data. In a conflict zone, precise hospital coordinates can endanger lives. What's sensitive depends on place, time and use.
Hulsebos and master's student Liang Telkamp developed two mechanisms. First, context-aware reasoning to cut false positives: a company address might be public and safe, while similar-looking data elsewhere is risky. Second, a "retrieve then detect" approach that pulls relevant, up-to-date policies and applies them to each dataset.
The results helped UN Quality Assessment Officers by extracting the right rules from long protocols and explaining why a dataset should be restricted. The work showed strong performance on personal information and meaningful gains on situational sensitivity. Telkamp's contributions earned the Amsterdam AI Thesis Award and are now being integrated at the UN.
Why this matters for every research and science organisation
Two issues limit insight flow today. First, people only ask questions about data they already know exists. Second, the people who know the data rarely face the business or research question directly. That disconnect stalls progress.
Plain-language access to relational data changes that. If a sales lead, lab manager or program officer can ask a question and get a traceable answer in minutes, "speed to insight" improves across the board.
Tools on the way: open-source building blocks
Hulsebos's group is releasing open-source tools in the coming months. One PhD project focuses on automated dataset retrieval and natural-language-to-SQL generation. The goal is practical utility, not just papers.
She also developed DataScout at UC Berkeley-task-based dataset search with LLMs. In user studies, it outperformed keyword search on traditional data platforms and cut the time to find training data for ML models from weeks to days.
How to operationalise table representation learning
- Start with high-value questions. Inventory the recurring questions teams ask and the sources they repeatedly pull from.
- Create semantic signals. Profile columns, infer types, units, ranges and entities. Store embeddings for columns/tables to enable semantic matching.
- Build "retrieve then answer." Retrieve candidate tables, join plans and related policies; then produce the answer with an explanation and lineage.
- Enforce policy at query time. Dynamically apply sensitivity rules by location, time and context-don't hard-code them.
- Log everything. Keep query plans, joins, filters and prompts. Make it easy to audit and reproduce answers.
- Evaluate like a scientist. Use held-out questions, perturbations and unit tests for joins and type inference. Track precision/recall, not just demos.
- Keep a human in the loop. Allow review of joins, transformations and final answers before decisions depend on them.
What success looks like
- Query time in minutes, not weeks. Answers have citations, join logic and policy checks attached.
- Lower false positives on sensitive data through context-aware detection.
- Less dependency on ad-hoc dashboards and one-off SQL. More direct access for domain experts.
- Data scientists spending more time on modelling and ethics, less on cleaning and rework.
The bigger picture
LLMs are trained on scraped public data. That makes careful handling of sensitive datasets non-negotiable-once something enters training corpora, you can't easily remove it. Context-first sensitivity checks and policy-aware retrieval help prevent bad exposure and enable safe sharing.
The promise of table representation learning is simple: connect questions to the right data, then to the right insight, with clear reasoning in between. That's how we use the 99% of data we currently ignore.
Want to skill up for applied AI in data analysis? Explore focused training for researchers and data teams: AI Certification for Data Analysis.
Your membership also unlocks: