First Cross-Model Benchmark of Prompt Styles for Structured Data: Clear Trade-offs Developers Can Use Today
ELSP Nashville, TN & Williamsburg, VA - 24 Nov 2025 - A new study in Artif. Intell. Auton. Syst. provides the first systematic comparison of prompt styles across multiple LLMs for structured data generation. The goal: give practitioners a reliable way to choose between accuracy, speed, and cost.
The team evaluated six prompt formats across three leading models-ChatGPT-4o, Claude, and Gemini-using three datasets (personal stories, medical records, receipts). Metrics included accuracy, token cost, and generation time. The results make it far easier to pick the right setup for your pipeline.
Why this matters for research and production systems
Structured data feeds clinical tools, analytics, and downstream automation. Small differences in prompt design can shift accuracy by double digits and change cost profiles significantly.
Prior work focused on single models and narrow prompt sets. This study expands the scope and shows where each model and format shines under real constraints.
Key results-fast takeaways
- Claude led overall accuracy at 85%, especially with hierarchical formats (JSON, YAML). This is well-suited to high-stakes tasks like generating structured medical records where data integrity is critical.
- ChatGPT-4o delivered the lowest token usage (often under 100 tokens for lightweight formats) and fastest generation (about 4-6 seconds). This fits high-volume, cost-sensitive workloads such as receipt processing.
- Gemini provided balanced performance across metrics, with some variability on mixed-format prompts (e.g., Hybrid CSV/Prefix).
- All models struggled on narrative-style unstructured inputs (personal stories). Accuracy dropped to around 40% across formats, signaling the need for different strategies or multi-step pipelines for free text.
What the data implies
"Hierarchical formats like JSON and YAML boost accuracy but come with higher token costs, while lightweight options like CSV and simple prefixes cut latency without sacrificing much precision," said Ashraf Elnashar. The trade-off is straightforward: more structure buys correctness, less structure buys throughput.
Example: choose Claude + JSON for healthcare-grade precision; use ChatGPT-4o + CSV for fast, inexpensive receipt ingestion at scale.
Prompt formats: simple rules that work
- JSON/YAML: Highest schema clarity and field consistency; higher token cost, slower per request. Best for complex schemas and strict validation.
- CSV/Prefix: Lowest overhead and quickest turnarounds; slightly lower accuracy but strong enough for routine, well-bounded tasks.
- Hybrid CSV/Prefix: Can work, but expect model-specific variability (notably with Gemini in this study).
If you need a refresher on structured formats, see the JSON specification for reference: json.org.
Practical recommendations by use case
- Clinical and regulated data: Claude + JSON or YAML. Invest in schema validation and strict post-processing. Accept higher token spend for accuracy.
- E-commerce and transactional data: ChatGPT-4o + CSV or Prefix. Favor speed and cost per record; backstop with lightweight validators.
- General analytics pipelines: Gemini + JSON or CSV. Balanced default if you need steady performance across mixed workloads.
- Narrative-heavy inputs: Expect lower accuracy. Consider a two-step approach (extract key entities first, then assemble final structure) or a retrieval-augmented pass with schema hints.
Decision shortcuts
- Need ≥80% accuracy on complex schemas? Use hierarchical prompts (JSON/YAML) with Claude.
- Need sub-6-second responses and tight budgets? Use CSV/Prefix with ChatGPT-4o.
- Need a balanced default and can tolerate some variability? Use Gemini with JSON.
- Unseen fields or noisy instructions? Add schema examples and validation rules directly in the prompt; fail fast with programmatic checks.
Study setup (so you can replicate or adapt)
Three datasets: personal stories, medical records, receipts. Six prompt styles spanning hierarchical and lightweight formats. Three LLMs: ChatGPT-4o, Claude, Gemini. Metrics: accuracy, token count, generation time.
The authors provide datasets, prompt templates, validation scripts, and design guidelines to accelerate adoption. Access everything here: GitHub repository.
Quotes from the team
"Prior research only scratched the surface, testing a limited set of prompts on single models," said Elnashar. "Our work expands the horizon by evaluating six widely used prompt formats across ChatGPT-4o, Claude, and Gemini."
"We wanted to move beyond theory-these resources let developers skip the trial-and-error and directly apply our findings to their pipelines," said Jules White.
Douglas C. Schmidt added, "As AI becomes more integrated into critical systems, we need to understand how these models perform when faced with the messiness of real data."
Where this goes next
The next step is testing resilience: noisy instructions, missing fields, and unseen schemas. Those factors determine whether a solution holds up under production pressure.
The study was conducted without specific grant funding. The authors acknowledge support from ChatGPT-4o, Claude, and Gemini for code generation, visualization, and comparative evaluation.
Skills and training
If your team is building data pipelines or schema-first workflows, consider strengthening prompt engineering and validation skills. See curated options here: Prompt Engineering Resources and this applied track for analysts: AI Certification for Data Analysis.
Your membership also unlocks: