APIEval-20

APIEval-20: a black-box benchmark for API testing agents. From a JSON schema and one sample payload, agents generate test suites run against live APIs with seeded bugs. Objective scoring measures bug detection, API coverage, and efficiency.

APIEval-20

About APIEval-20

APIEval-20 is an open benchmark for evaluating AI agents that generate tests for APIs. It uses a black-box setup where each agent receives only a JSON schema and a sample payload, then produces a test suite that is executed against live reference APIs containing planted bugs.

Review

APIEval-20 aims to provide an objective, executable way to measure how well AI agents find real API bugs. Scoring is based on concrete outcomes - whether a generated test actually detects a planted bug - and the benchmark covers a range of API behaviors including authentication, pagination, error handling, schemas, and multi-step flows.

Key Features

  • Black-box evaluation: agents get only schema and one sample payload, reflecting limited-information scenarios.
  • Executable scoring: tests are run against live reference APIs and bug detection is determined automatically (no subjective judging).
  • Broad task taxonomy: includes auth, errors, pagination, schema constraints, field relationships, and multi-step flows.
  • Open and reproducible: dataset and benchmark are available on Hugging Face for inspection and independent runs.
  • Leaderboard and breakdown plans: intends to provide per-bug-class breakdowns to show strengths and gaps across agents.

Pricing and Value

APIEval-20 is listed as free and is openly hosted on Hugging Face, making it accessible for teams and researchers to run locally or integrate into CI. Its main value is in providing an objective, repeatable benchmark for comparing API-testing agents and tracking improvements in bug detection, coverage, and efficiency without needing access to source code or full documentation.

Pros

  • Objective, deterministic scoring reduces ambiguity in comparisons between agents.
  • Realistic black-box setup matches many practical testing scenarios where source access is unavailable.
  • Open dataset and reproducible runs allow teams to validate results and reproduce findings.
  • Taxonomy that includes multi-step flows and field relationships helps reveal qualitative differences between agents.

Cons

  • Current v1 scoring is unweighted: all caught bugs count the same, which may underrepresent practical severity differences.
  • Some usability gaps have been noted (billing/usage reporting and a UI that could be streamlined), which may affect onboarding for new users.
  • Because tests run against reference APIs with planted bugs, additional custom scenarios may be needed to reflect specific production environments.

APIEval-20 is best suited for teams building or evaluating AI-driven API testing agents, QA researchers, and organizations that need an objective comparison of bug detection and coverage. For groups looking for severity-weighted scoring or a turnkey commercial test management product, additional tooling or reporting will likely be needed alongside this benchmark.



Open 'APIEval-20' Website
Get Daily AI Tools Updates

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)

Join thousands of clients on the #1 AI Learning Platform

Explore just a few of the organizations that trust Complete AI Training to future-proof their teams.