Why AI flubs four-digit multiplication-and how a simple training tweak fixes it

Transformers miss 4-digit multiplication, stuck under 1% even with more layers. Add intermediate steps or a tiny auxiliary loss and they hit 99-100% with stable memory.

Categorized in: AI News Science and Research
Published on: Dec 30, 2025
Why AI flubs four-digit multiplication-and how a simple training tweak fixes it

Why Transformers Still Miss on Four-Digit Multiplication-and How to Fix It

Large language models write code and pass reasoning benchmarks. Yet give them 4-digit multiplication and, without the right training signal, they miss almost every time. A new arXiv preprint from collaborators at the University of Chicago, MIT, Harvard, Waterloo, and Google DeepMind reverse-engineers the failure-and shows simple ways to get to near-perfect accuracy.

The punchline: standard fine-tuning plateaus under 1% accuracy across 2-12 layers. A targeted training objective or an implicit chain-of-thought approach unlocks 99-100% on the same task.

The core issue: long-range dependencies

Multi-digit multiplication needs memory. You compute partial products, carry digits, and reuse intermediate values several steps later. Off-the-shelf transformers don't reliably store and retrieve those intermediate states during generation.

Under standard training, models settle into local optima: pattern-matching shortcuts that score well on easy cases but break when carry propagation and multi-step interaction matter. More layers and more data don't push them out of that trap.

What failed under standard fine-tuning

  • Across 2 to 12 layers, accuracy on 4-digit × 4-digit stayed below 1%.
  • Probing couldn't decode meaningful intermediate values (e.g., running sums) from hidden states.
  • Attention showed no stable routing to store and later retrieve partial products.

What worked: Implicit Chain of Thought (ICoT)

ICoT reaches 100% accuracy by training with intermediate reasoning steps early, then gradually removing them so the model internalizes the process. Two key differences emerged under analysis:

  • Decodable memory: Probes could read intermediate values (running sums, carries) from hidden states, confirming the model learned to track long-range dependencies.
  • Structured attention over time: Early layers compute digit-pair products and store them at consistent positions; later layers fetch exactly what's needed per output digit.

The team also found elegant internal representations: digits encoded as Fourier-like bases, with digit-pair multiplication organized by a geometric operation related to the Minkowski sum. This wasn't hand-engineered-it emerged during training.

A simple fix without explicit CoT

If the blocker is missing guidance on intermediate values, provide it. Adding a small auxiliary objective that teaches the model to track running sums at each step lifted a 2-layer transformer from failure to 99% accuracy-no explicit chain-of-thought tokens required.

  • Attention maps showed storage and retrieval of partial products, similar to ICoT's mechanism.
  • The model learned to track multiple digit pairs in parallel.

Why this matters for your research

  • Scaling is not a substitute for mechanism. Tasks with strong long-range dependencies need training signals that create persistent memory and retrieval paths.
  • Auxiliary objectives beat brute force. Small, well-chosen targets (e.g., running sums, carries) can unlock capabilities that depth and data fail to produce.
  • Probe, don't guess. Hidden-state probes and attention analysis reveal whether the model learned the algorithm or is pattern-matching.
  • General beyond arithmetic. Long-range dependency pitfalls appear in language modeling, code generation, and sequential decision tasks.

Practical steps you can try

  • Add auxiliary losses for intermediate quantities (running sums, carries, partial products) during training.
  • Use curriculum or ICoT-style training: start with explicit steps; progressively remove them to force internalization.
  • Probe hidden states for decodable intermediates; track whether they remain stable across layers and positions.
  • Inspect attention heads for time-consistent storage and retrieval locations tied to each output digit.
  • Evaluate on "jagged frontier" tasks: simple rules with long chains (multi-digit arithmetic, algorithmic reasoning) to detect local-optimum behavior early.

The broader takeaway

Architectural constraints and training objectives determine whether a model can learn processes, not just answers. As one of the study leads put it, we need to chart how these systems learn if we expect them to support critical decisions. The right guidance transforms failure cases into solved problems-without bloating models.

Citation

Preprint: Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls. arXiv:2510.00184

Want deeper training on methods that actually transfer?

If you're building internal upskilling around model training, evaluation, and prompting, see curated curricula by skill at Complete AI Training.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide