Tencent Launches ArtifactsBench to Improve Testing of Creative AI Models
Ever asked an AI to create a simple webpage or chart only to get something that technically works but feels off? Buttons in odd places, clashing colors, or clunky animations are common issues. This points to a key challenge: how do we train AI to develop good taste and deliver better user experiences?
Traditional AI code tests focus on whether the code runs correctly, ignoring how it looks and feels. That means code could be functional but still provide a poor interactive experience. Tencent’s new benchmark, ArtifactsBench, aims to change this by evaluating AI-generated code with an eye for visual quality and user interaction.
How ArtifactsBench Works
ArtifactsBench tests AI on over 1,800 creative tasks, including building data visualizations, web apps, and interactive mini-games. The AI generates code, which is then automatically run in a secure sandbox. The system captures screenshots over time to observe animations, button clicks, and dynamic feedback.
Next, all this information—the original task, the AI’s code, and the screenshots—is passed to a Multimodal Large Language Model (MLLM). This model acts as a judge, scoring the AI's work based on a detailed checklist covering:
- Functionality
- User experience
- Aesthetic quality
This approach ensures evaluation is consistent, thorough, and aligned with real user needs.
Does the AI Judge Have Good Taste?
ArtifactsBench’s ratings were compared with WebDev Arena, a platform where humans vote on the best AI creations. The match rate was 94.4%, a major improvement over older benchmarks that only reached about 69.4% consistency. Professional developers also agreed with the benchmark’s judgments over 90% of the time.
Insights from Testing Top AI Models
Tencent tested more than 30 leading AI models. Google’s Gemini-2.5-Pro and Anthropic’s Claude 4.0-Sonnet topped the leaderboard. However, an interesting discovery was that generalist AI models often outperformed specialized ones.
For example, Qwen-2.5-Instruct, a general-purpose model, scored higher than Qwen-2.5-coder (code-specific) and Qwen2.5-VL (vision-specialized). The reason is that creating high-quality visual applications requires more than coding or visual skills alone—it demands strong reasoning, precise instruction following, and a natural sense of design.
Why Creatives Should Care
AI is becoming an increasingly valuable tool for designers, developers, and creatives who want to prototype faster or generate ideas. But usability and visual appeal matter just as much as functionality.
By providing a benchmark that reflects these priorities, Tencent’s ArtifactsBench helps push AI development toward creating outputs that users actually want to engage with.
For creatives looking to stay ahead in AI-driven design and development, exploring platforms and training on how these models work can be beneficial. Resources like Complete AI Training’s courses for creative professionals offer practical insights on working effectively with AI tools.
Your membership also unlocks: