MIT dataset of one million charts helps smaller AI models outperform larger commercial rivals

MIT and MIT-IBM researchers released ChartNet, a dataset of over one million charts built to train AI models on reading business and scientific figures. Smaller models trained on it outperformed larger commercial alternatives on chart tasks.

Categorized in: AI News Science and Research
Published on: Jun 03, 2026
MIT dataset of one million charts helps smaller AI models outperform larger commercial rivals

MIT researchers release ChartNet dataset to improve AI chart interpretation

Researchers from MIT and the MIT-IBM Computing Research Lab developed ChartNet, a training dataset of more than one million charts designed to teach vision-language models how to interpret business and scientific figures. The dataset includes synthetic charts, their underlying code, text descriptions, numerical tables, and question-and-answer pairs.

The work addresses a real performance gap. Current generative AI and LLM models often struggle with chart interpretation because the task requires integrating visual, numerical, and linguistic understanding simultaneously. Companies deploying state-of-the-art models still receive inaccurate or incomplete information from charts.

Smaller models now outperform commercial alternatives

When trained on ChartNet, smaller open-source models consistently outperformed much larger commercial models on chart reconstruction, data extraction, summarization, and question-answering tasks. This matters for smaller firms with limited budgets that cannot afford expensive proprietary systems.

The dataset includes a selection of charts annotated by human experts, allowing practitioners to fine-tune models for specific applications. An automated quality check process ensures synthetic data are executable and accurate.

How the dataset was built

Researchers created a two-step pipeline to generate synthetic data. First, their system translates existing chart images into code. Then it iteratively modifies that code to change chart type, data values, topic, colors, and other attributes.

"We can start from a single chart and come up with hundreds of augmentations of it," explained Jovana Kondic, the MIT graduate student who led the work. "This is how we were able to build a dataset with more than a million diverse images."

The lack of high-quality training data has been a major bottleneck. Unlike humans, vision-language models need to see thousands of examples during training to reliably recognize a line chart or other basic chart types.

Practical applications in finance and science

Data analysis workflows in finance, business reporting, and scientific research all depend on accurate chart interpretation. If models can extract trend descriptions and other information from charts, they accelerate downstream decision-making.

The researchers plan to expand ChartNet with more complex chart types and will incorporate feedback from the research community. The dataset is open-source and available for training new models.

The research will be presented at the IEEE Computer Vision and Pattern Recognition Conference. Work was funded in part by the MIT-IBM Computing Research Lab.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)