Gallagher Re warned on 12 June 2026 that current artificial intelligence evaluation methods are unfit for underwriting, requiring a shift toward failure-focused testing to price AI risk accurately. Without this change, insurers risk pricing uncertainty rather than actual risk, which could inflate premiums and stall market development.
The limits of current benchmarks
In a new report titled "Anthropic's Fourth Way: Why Restricted AI Models Are a Challenge for Insurers," the global reinsurance firm argues that standard benchmarks focus on capability rather than failure. These standardized tests measure performance under controlled conditions, leaving blind spots for ambiguous, real-world inputs. A model scoring highly on fixed tasks can still hallucinate or make inconsistent decisions in deployment.
Ed Pocock, global head of cyber security at Gallagher Re, emphasized the gap between test scores and underwriting needs. "They indicate what a model can do under controlled, but insurers are concerned with how models fail, how often they fail, and whether those failures could be correlated across a portfolio," Pocock said. This evaluation gap directly affects any insurer weighing AI exposure, including captives considering how to underwrite or retain risks from internal AI deployment.
The threat of concentration risk
The report highlights benchmark contamination as a growing problem, where models are increasingly shaped by the very tests used to evaluate them. This dynamic inflates published scores and reduces their value as a guide to real-world reliability. Furthermore, efforts to reduce failure rates and boost test performance can increase model homogeneity. "This risks erasing useful differentiation between systems and increasing concentration risk," Pocock said.
Concentration risk becomes acute when widely shared foundation models fail. If multiple insureds rely on the same underlying technology, a single flaw could trigger correlated losses across an entire portfolio. The reinsurance market can actively influence which models are deployed through underwriting requirements, pricing signals, and coverage design. Professionals seeking to understand these dynamics can explore broader trends in AI for Insurance to see how underwriting standards adapt to new technologies.
The challenge of restricted models
Gallagher Re also identified restricted-distribution AI as a new, fourth category of frontier model, joining open source, open weight, and proprietary systems. The firm pointed to Anthropic's Mythos model, released under Project Glasswing, which is available only to a vetted group of partners. While the UK AI Security Institute has analyzed Mythos, Gallagher Re argues that insurers need access to independent, third-party evaluations to price risk accurately.
"If a model cannot be independently evaluated, it cannot be meaningfully priced," Pocock said. "Insurers could end up loading for uncertainty rather than reflecting actual risk. That raises costs for everyone and slows the market's development." The firm calls for evaluation methods that test AI systems as they operate, using real-world inputs under adversarial conditions over time. Organizations tracking these shifts in risk management should monitor developments in AI for Finance, where similar demands for transparent, auditable systems are driving market standards.
Why this matters for insurance professionals
Underwriters and risk managers must demand evaluation metrics that measure hallucination rates, decision consistency, and correlated failure potential. Relying on vendor-provided benchmark scores leaves portfolios exposed to hidden, systemic vulnerabilities. As Pocock noted, "Better evaluation gives the market the tools to reward transparency and robustness. Without it, we risk defaulting to scale and brand as proxies for safety, which could amplify the concentration risks we'll need to manage."
Your membership also unlocks: