Tencent’s ArtifactsBench sets new standard for evaluating creative AI models

Tencent’s ArtifactsBench evaluates AI-generated code on functionality, user experience, and visual quality through 1,800+ creative tasks. It matches human ratings with 94.4% accuracy, improving AI testing.

Categorized in: AI News Creatives
Published on: Jul 10, 2025
Tencent’s ArtifactsBench sets new standard for evaluating creative AI models

Tencent Launches ArtifactsBench to Improve Testing of Creative AI Models

Ever asked an AI to create a simple webpage or chart only to get something that technically works but feels off? Buttons in odd places, clashing colors, or clunky animations are common issues. This points to a key challenge: how do we train AI to develop good taste and deliver better user experiences?

Traditional AI code tests focus on whether the code runs correctly, ignoring how it looks and feels. That means code could be functional but still provide a poor interactive experience. Tencent’s new benchmark, ArtifactsBench, aims to change this by evaluating AI-generated code with an eye for visual quality and user interaction.

How ArtifactsBench Works

ArtifactsBench tests AI on over 1,800 creative tasks, including building data visualizations, web apps, and interactive mini-games. The AI generates code, which is then automatically run in a secure sandbox. The system captures screenshots over time to observe animations, button clicks, and dynamic feedback.

Next, all this information—the original task, the AI’s code, and the screenshots—is passed to a Multimodal Large Language Model (MLLM). This model acts as a judge, scoring the AI's work based on a detailed checklist covering:

  • Functionality
  • User experience
  • Aesthetic quality

This approach ensures evaluation is consistent, thorough, and aligned with real user needs.

Does the AI Judge Have Good Taste?

ArtifactsBench’s ratings were compared with WebDev Arena, a platform where humans vote on the best AI creations. The match rate was 94.4%, a major improvement over older benchmarks that only reached about 69.4% consistency. Professional developers also agreed with the benchmark’s judgments over 90% of the time.

Insights from Testing Top AI Models

Tencent tested more than 30 leading AI models. Google’s Gemini-2.5-Pro and Anthropic’s Claude 4.0-Sonnet topped the leaderboard. However, an interesting discovery was that generalist AI models often outperformed specialized ones.

For example, Qwen-2.5-Instruct, a general-purpose model, scored higher than Qwen-2.5-coder (code-specific) and Qwen2.5-VL (vision-specialized). The reason is that creating high-quality visual applications requires more than coding or visual skills alone—it demands strong reasoning, precise instruction following, and a natural sense of design.

Why Creatives Should Care

AI is becoming an increasingly valuable tool for designers, developers, and creatives who want to prototype faster or generate ideas. But usability and visual appeal matter just as much as functionality.

By providing a benchmark that reflects these priorities, Tencent’s ArtifactsBench helps push AI development toward creating outputs that users actually want to engage with.

For creatives looking to stay ahead in AI-driven design and development, exploring platforms and training on how these models work can be beneficial. Resources like Complete AI Training’s courses for creative professionals offer practical insights on working effectively with AI tools.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide