Why AIs That Ace Exams Still Struggle to Beat Pokémon

Watching AIs bumble through Pokémon on Twitch is a live audit of planning, memory, and tools. The wins are real, but the stalls reveal where agents still trip over tiny details.

Published on: Jan 14, 2026
Why AIs That Ace Exams Still Struggle to Beat Pokémon

AI vs. Pokémon: What Twitch Streams Reveal About Real-World Capability

Right now on Twitch, you can watch three flagship systems-GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro-try to beat classic Pokémon games. By human standards, they're slow, overconfident, and frequently lost. That's exactly why these streams are useful. They expose how these models plan, adapt, and execute over long horizons-far better than glossy benchmark charts ever could.

It's not just the model-it's the control stack around it

Two streams can look similar while measuring very different things. Some setups add a tool layer-a "wrapper" that translates visuals into text, routes to puzzle-solvers, or adds custom helpers. That gives the model a huge leg up on tasks it's naturally weak at, like intricate visual reasoning.

Gemini's setup adds more of these helpers, which is why it cleared games that stopped earlier models. Claude's run is closer to bare-bones: fewer tools, more raw model capability on display. The difference matters because it mirrors how we use AI at work-most assistants rely on a stack of tools to browse, code, or call APIs.

Why six-year-olds breeze through what LLMs bungle

Pokémon is turn-based. No twitch reflexes. The loop is simple: the system gets a screenshot and a short instruction, "thinks," then outputs an action like "press A." One step at a time. Claude Opus 4.5 has logged 500+ hours and around 170,000 steps.

Here's the catch: each step is a new instance that only reads notes left by the previous one-like an amnesiac with Post-its. These systems know a ton about Pokémon from training data. They just struggle to execute a plan across thousands of steps without forgetting, looping, or mis-seeing a tiny but critical detail.

If you build or evaluate agents, that should ring loud alarms. The gap isn't knowledge-it's stable execution over long timelines.

Progress is real-and specific

Claude Opus 4.5 is better at leaving itself useful notes and interpreting what it sees, so it gets further before stalling. Gemini 3 Pro beat Pokémon Blue and then cleared Pokémon Crystal without losing a battle, something Gemini 2.5 Pro couldn't do.

There are still comic failures. Claude reportedly spent four days circling a gym because it didn't realize it needed to cut down a tree to enter. That's not trivia-it's a weak link in perception and planning that will also surface in enterprise workflows if you don't design around it.

Tool stacks turn models into workers

Give a model the right control stack and it can build software, run checks, and manage processes. Claude Code-a layer that lets Claude write and run its own code-was dropped into RollerCoaster Tycoon and reportedly managed a theme park effectively. That looks a lot like autonomous back-office work: multi-step tasks, analytics, decisions, loops.

Expect this pattern to spread. AI with tool access can handle large chunks of knowledge work-software, accounting, legal analysis, design systems-while still struggling with tasks that demand tight real-time reaction.

Humanness, for better and worse

These runs surface familiar quirks. Under pressure (like almost fainting in battle), reasoning quality drops. That mirrors how humans think under stress-and it matters when deploying agents into high-stakes workflows.

There's also a spark of personality. After beating Pokémon Blue, Gemini 3 Pro left a note to "go home and talk to Mom one last time." Unnecessary, but oddly fitting-an echo of how we close loops.

What this means for builders, researchers, and tech leaders

  • Judge the stack, not just the model. Ask what tools, translations, and shortcuts are doing the heavy lifting.
  • Design for long horizons. Memory, state, and retry logic matter more than clever prompts.
  • Instrument everything. Log step-level decisions, errors, and loops so you can patch systemic failures fast.
  • Treat perception as a first-class problem. Small misreads create multi-hour stalls.
  • Stress-test under pressure. Reasoning often degrades when the "stakes" feel high, even in games.

Why Pokémon is a clean testbed

It's culturally familiar, turn-based, and slow enough to analyze decision traces step by step. That makes it a practical benchmark for agency, not just knowledge recall. If you want a deeper background on how these models reason, start with the basics of a large language model, then look at how tool use changes behavior.

For context on the game itself, see Pokémon Red and Blue. It highlights how "simple" mechanics still demand precise multi-step planning.

If you're upskilling your team

Focus training on orchestration: memory design, tool routing, evaluation harnesses (without overfitting), and safety. Those are the levers that convert raw model IQ into consistent output.

If you want structured paths by role or skill, explore curated options here: AI courses by job.

Bottom line

Watching AI fumble through Pokémon is more than entertainment. It's a live audit of where agentic systems shine, where they stall, and what it takes to turn raw intelligence into reliable work. Study the failures. That's where the leverage is.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide