Which Prompt Wins? Claude Leads Accuracy, ChatGPT-4o Wins on Speed and Cost, Gemini Finds the Middle for Structured Data

New study benchmarks prompt styles across ChatGPT-4o, Claude, and Gemini for data. Claude tops accuracy with JSON/YAML, while 4o is fastest and cheapest; Gemini stays balanced.

Categorized in: AI News Science and Research

Published on: Dec 02, 2025

First Cross-Model Benchmark of Prompt Styles for Structured Data: Clear Trade-offs Developers Can Use Today

ELSP Nashville, TN & Williamsburg, VA - 24 Nov 2025 - A new study in Artif. Intell. Auton. Syst. provides the first systematic comparison of prompt styles across multiple LLMs for structured data generation. The goal: give practitioners a reliable way to choose between accuracy, speed, and cost.

The team evaluated six prompt formats across three leading models-ChatGPT-4o, Claude, and Gemini-using three datasets (personal stories, medical records, receipts). Metrics included accuracy, token cost, and generation time. The results make it far easier to pick the right setup for your pipeline.

Why this matters for research and production systems

Structured data feeds clinical tools, analytics, and downstream automation. Small differences in prompt design can shift accuracy by double digits and change cost profiles significantly.

Prior work focused on single models and narrow prompt sets. This study expands the scope and shows where each model and format shines under real constraints.

Key results-fast takeaways

Claude led overall accuracy at 85%, especially with hierarchical formats (JSON, YAML). This is well-suited to high-stakes tasks like generating structured medical records where data integrity is critical.
ChatGPT-4o delivered the lowest token usage (often under 100 tokens for lightweight formats) and fastest generation (about 4-6 seconds). This fits high-volume, cost-sensitive workloads such as receipt processing.
Gemini provided balanced performance across metrics, with some variability on mixed-format prompts (e.g., Hybrid CSV/Prefix).
All models struggled on narrative-style unstructured inputs (personal stories). Accuracy dropped to around 40% across formats, signaling the need for different strategies or multi-step pipelines for free text.

What the data implies

"Hierarchical formats like JSON and YAML boost accuracy but come with higher token costs, while lightweight options like CSV and simple prefixes cut latency without sacrificing much precision," said Ashraf Elnashar. The trade-off is straightforward: more structure buys correctness, less structure buys throughput.

Example: choose Claude + JSON for healthcare-grade precision; use ChatGPT-4o + CSV for fast, inexpensive receipt ingestion at scale.

Prompt formats: simple rules that work

JSON/YAML: Highest schema clarity and field consistency; higher token cost, slower per request. Best for complex schemas and strict validation.
CSV/Prefix: Lowest overhead and quickest turnarounds; slightly lower accuracy but strong enough for routine, well-bounded tasks.
Hybrid CSV/Prefix: Can work, but expect model-specific variability (notably with Gemini in this study).

If you need a refresher on structured formats, see the JSON specification for reference: json.org.

Practical recommendations by use case

Clinical and regulated data: Claude + JSON or YAML. Invest in schema validation and strict post-processing. Accept higher token spend for accuracy.
E-commerce and transactional data: ChatGPT-4o + CSV or Prefix. Favor speed and cost per record; backstop with lightweight validators.
General analytics pipelines: Gemini + JSON or CSV. Balanced default if you need steady performance across mixed workloads.
Narrative-heavy inputs: Expect lower accuracy. Consider a two-step approach (extract key entities first, then assemble final structure) or a retrieval-augmented pass with schema hints.

Decision shortcuts

Need ≥80% accuracy on complex schemas? Use hierarchical prompts (JSON/YAML) with Claude.
Need sub-6-second responses and tight budgets? Use CSV/Prefix with ChatGPT-4o.
Need a balanced default and can tolerate some variability? Use Gemini with JSON.
Unseen fields or noisy instructions? Add schema examples and validation rules directly in the prompt; fail fast with programmatic checks.

Study setup (so you can replicate or adapt)

Three datasets: personal stories, medical records, receipts. Six prompt styles spanning hierarchical and lightweight formats. Three LLMs: ChatGPT-4o, Claude, Gemini. Metrics: accuracy, token count, generation time.

The authors provide datasets, prompt templates, validation scripts, and design guidelines to accelerate adoption. Access everything here: GitHub repository.

Quotes from the team

"Prior research only scratched the surface, testing a limited set of prompts on single models," said Elnashar. "Our work expands the horizon by evaluating six widely used prompt formats across ChatGPT-4o, Claude, and Gemini."

"We wanted to move beyond theory-these resources let developers skip the trial-and-error and directly apply our findings to their pipelines," said Jules White.

Douglas C. Schmidt added, "As AI becomes more integrated into critical systems, we need to understand how these models perform when faced with the messiness of real data."

Where this goes next

The next step is testing resilience: noisy instructions, missing fields, and unseen schemas. Those factors determine whether a solution holds up under production pressure.

The study was conducted without specific grant funding. The authors acknowledge support from ChatGPT-4o, Claude, and Gemini for code generation, visualization, and comparative evaluation.

Skills and training

If your team is building data pipelines or schema-first workflows, consider strengthening prompt engineering and validation skills. See curated options here: Prompt Engineering Resources and this applied track for analysts: AI Certification for Data Analysis.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Which Prompt Wins? Claude Leads Accuracy, ChatGPT-4o Wins on Speed and Cost, Gemini Finds the Middle for Structured Data

First Cross-Model Benchmark of Prompt Styles for Structured Data: Clear Trade-offs Developers Can Use Today

Why this matters for research and production systems

Key results-fast takeaways

What the data implies

Prompt formats: simple rules that work

Practical recommendations by use case

Decision shortcuts

Study setup (so you can replicate or adapt)

Quotes from the team

Where this goes next

Skills and training

Related AI News for Science and Research

UT Arlington AI study finds no brain-structure link to sense of direction in young adults

AI's Three Paths: US speed, China scale, EU trust-and the race to keep systems interoperable

AI in cancer research: big bets, bold claims, and why we're still toddlers in biology

Science Meets GenAI: What Works, What Worries, What's Next

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: