Which Prompt Wins? Claude Leads Accuracy, ChatGPT-4o Wins on Speed and Cost, Gemini Finds the Middle for Structured Data

New study benchmarks prompt styles across ChatGPT-4o, Claude, and Gemini for data. Claude tops accuracy with JSON/YAML, while 4o is fastest and cheapest; Gemini stays balanced.

Categorized in: AI News Science and Research
Published on: Dec 02, 2025
Which Prompt Wins? Claude Leads Accuracy, ChatGPT-4o Wins on Speed and Cost, Gemini Finds the Middle for Structured Data

First Cross-Model Benchmark of Prompt Styles for Structured Data: Clear Trade-offs Developers Can Use Today

ELSP Nashville, TN & Williamsburg, VA - 24 Nov 2025 - A new study in Artif. Intell. Auton. Syst. provides the first systematic comparison of prompt styles across multiple LLMs for structured data generation. The goal: give practitioners a reliable way to choose between accuracy, speed, and cost.

The team evaluated six prompt formats across three leading models-ChatGPT-4o, Claude, and Gemini-using three datasets (personal stories, medical records, receipts). Metrics included accuracy, token cost, and generation time. The results make it far easier to pick the right setup for your pipeline.

Why this matters for research and production systems

Structured data feeds clinical tools, analytics, and downstream automation. Small differences in prompt design can shift accuracy by double digits and change cost profiles significantly.

Prior work focused on single models and narrow prompt sets. This study expands the scope and shows where each model and format shines under real constraints.

Key results-fast takeaways

  • Claude led overall accuracy at 85%, especially with hierarchical formats (JSON, YAML). This is well-suited to high-stakes tasks like generating structured medical records where data integrity is critical.
  • ChatGPT-4o delivered the lowest token usage (often under 100 tokens for lightweight formats) and fastest generation (about 4-6 seconds). This fits high-volume, cost-sensitive workloads such as receipt processing.
  • Gemini provided balanced performance across metrics, with some variability on mixed-format prompts (e.g., Hybrid CSV/Prefix).
  • All models struggled on narrative-style unstructured inputs (personal stories). Accuracy dropped to around 40% across formats, signaling the need for different strategies or multi-step pipelines for free text.

What the data implies

"Hierarchical formats like JSON and YAML boost accuracy but come with higher token costs, while lightweight options like CSV and simple prefixes cut latency without sacrificing much precision," said Ashraf Elnashar. The trade-off is straightforward: more structure buys correctness, less structure buys throughput.

Example: choose Claude + JSON for healthcare-grade precision; use ChatGPT-4o + CSV for fast, inexpensive receipt ingestion at scale.

Prompt formats: simple rules that work

  • JSON/YAML: Highest schema clarity and field consistency; higher token cost, slower per request. Best for complex schemas and strict validation.
  • CSV/Prefix: Lowest overhead and quickest turnarounds; slightly lower accuracy but strong enough for routine, well-bounded tasks.
  • Hybrid CSV/Prefix: Can work, but expect model-specific variability (notably with Gemini in this study).

If you need a refresher on structured formats, see the JSON specification for reference: json.org.

Practical recommendations by use case

  • Clinical and regulated data: Claude + JSON or YAML. Invest in schema validation and strict post-processing. Accept higher token spend for accuracy.
  • E-commerce and transactional data: ChatGPT-4o + CSV or Prefix. Favor speed and cost per record; backstop with lightweight validators.
  • General analytics pipelines: Gemini + JSON or CSV. Balanced default if you need steady performance across mixed workloads.
  • Narrative-heavy inputs: Expect lower accuracy. Consider a two-step approach (extract key entities first, then assemble final structure) or a retrieval-augmented pass with schema hints.

Decision shortcuts

  • Need ≥80% accuracy on complex schemas? Use hierarchical prompts (JSON/YAML) with Claude.
  • Need sub-6-second responses and tight budgets? Use CSV/Prefix with ChatGPT-4o.
  • Need a balanced default and can tolerate some variability? Use Gemini with JSON.
  • Unseen fields or noisy instructions? Add schema examples and validation rules directly in the prompt; fail fast with programmatic checks.

Study setup (so you can replicate or adapt)

Three datasets: personal stories, medical records, receipts. Six prompt styles spanning hierarchical and lightweight formats. Three LLMs: ChatGPT-4o, Claude, Gemini. Metrics: accuracy, token count, generation time.

The authors provide datasets, prompt templates, validation scripts, and design guidelines to accelerate adoption. Access everything here: GitHub repository.

Quotes from the team

"Prior research only scratched the surface, testing a limited set of prompts on single models," said Elnashar. "Our work expands the horizon by evaluating six widely used prompt formats across ChatGPT-4o, Claude, and Gemini."

"We wanted to move beyond theory-these resources let developers skip the trial-and-error and directly apply our findings to their pipelines," said Jules White.

Douglas C. Schmidt added, "As AI becomes more integrated into critical systems, we need to understand how these models perform when faced with the messiness of real data."

Where this goes next

The next step is testing resilience: noisy instructions, missing fields, and unseen schemas. Those factors determine whether a solution holds up under production pressure.

The study was conducted without specific grant funding. The authors acknowledge support from ChatGPT-4o, Claude, and Gemini for code generation, visualization, and comparative evaluation.

Skills and training

If your team is building data pipelines or schema-first workflows, consider strengthening prompt engineering and validation skills. See curated options here: Prompt Engineering Resources and this applied track for analysts: AI Certification for Data Analysis.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide