Blackpearl Group released GTM-Bench, a new benchmark that tests how well AI systems perform real sales prospecting tasks, and the first results carry a stark warning: four of six leading general-purpose AI agents produced net negative scores, generating more bad leads than good ones. The benchmark evaluates models from OpenAI, Anthropic, Google, and DeepSeek, and it is designed to measure whether AI creates genuine commercial value rather than flooding CRM systems with low-quality records.
Built from 59,881 real-world prospecting queries, GTM-Bench covers 72 tasks across 11 task types and 15 market categories. It measures what Blackpearl calls buyer and seller coherence: an AI's ability to understand a seller's offering, identify likely buyers, and return prospect records that are both relevant and backed by evidence.
Volume does not equal value
The scoring system is deliberately simple: a good lead earns +1, a bad lead costs -1. This design reflects the real commercial damage of poor prospecting-wasted sales hours, drained budgets, and CRM databases cluttered with unqualified contacts.
Nick Lissette, Blackpearl's Chief Executive Officer, said the benchmark was built to shift attention from activity to outcomes. "The AI industry has become obsessed with output. It has spent far less time measuring outcomes. Poor-quality agentic AI doesn't simply fail to find opportunities - it empowers agents to consume budgets, waste sales hours, pollute CRM systems and send organisations chasing customers who were never likely to buy. Put bluntly, the research shows that bad AI may be worse than no AI at all."
In one extreme case, a single AI agent returned 6,342 prospect records for one task. The benchmark's analysis of 432 agent traces found that stronger systems cast a wide net but then narrowed results using evidence, while weaker systems simply returned large volumes with little filtering.
No model dominates every sales environment
Blackpearl's own Pearl Engine RTSA, purpose-built for go-to-market work, recorded a net score of +26,615.6. GPT-5.5 scored +4,040.9 when given access to Blackpearl's proprietary data and +1,015.4 using only public web evidence. But the results were not uniform: GPT-5.5 outperformed the RTSA system in several market categories, including healthcare, recruiting, industrial, and real estate, while Blackpearl's system was weaker in public sector and sustainability tasks.
That variation suggests sales teams cannot assume one AI model will perform best across every segment. Instead, the findings point to the need for task-specific testing rather than relying on general productivity claims.
Proprietary data and vertical design multiply results
The benchmark also tested the impact of proprietary data. GPT-5.5's score improved nearly fourfold when it could draw on Blackpearl's internal go-to-market data instead of relying solely on public web sources. However, that gain was still far below the RTSA system's result in the same environment, indicating that data alone does not close the gap.
"If you add great data to foundational models, you get results that are four times better. But then if you go further and put go-to-market vertical AI on top of that you get a further six times better results. When you combine the two, the results are twenty six times better," Lissette said.
Lissette pointed to a broader pattern of vertical AI systems-specialized agents built on top of foundation models for specific industries, similar to Harvey for legal or Cursor for coding. "Blackpearl is doing the same for go-to-market AI," he said.
Full transparency in methodology
Blackpearl has made the benchmark's methodology, code, tasks, and results fully public. Max Polaczuk, Vice President of AI at Blackpearl, addressed the potential conflict of the company both developing the benchmark and scoring strongly. "Our answer is transparency. Every task, every line of evaluation code and every run artifact is public. Anyone can re-run the experiments and challenge the findings. We hope people do, because that's how benchmarks improve."
Why this matters for sales professionals
Sales leaders evaluating AI prospecting tools need to look past flashy output numbers and measure net lead quality. A system that floods a CRM with thousands of unqualified records can drain more resources than it saves. For sales teams already evaluating AI for Sales tools, this benchmark provides a concrete way to test whether a system adds commercial value, not just activity. Testing models against specific sales tasks-and combining proprietary data with purpose-built vertical AI-can produce results that are an order of magnitude better than generic AI alone. The benchmark's public data offers a starting point for those who want to run their own comparisons.
Your membership also unlocks: