AI's next race: low-energy intelligence beyond transformers
The next phase of AI competition won't be won by whoever scales transformers the fastest. It will be won by whoever matches today's capabilities with a fraction of the energy use. That's the challenge laid out by Jennifer Chayes, dean of the College of Computing, Data Science, and Society at UC Berkeley, in a recent interview in Hong Kong.
In simple terms, a transformer is the core architecture that learns patterns from huge datasets by weighing relationships between tokens. A large language model (LLM) is a transformer trained at scale to generate and reason with human-like text-think ChatGPT or DeepSeek. For a quick primer, see this explainer on the transformer architecture.
Why the architecture question matters
"I would like to see alternatives to the transformer model to give us this kind of thinking without the high energy use that we have now," Chayes said. The hard part: the bottleneck is mathematical. No one has the full blueprint yet, and the risk is real-years of research could lead nowhere, which makes many experts hesitate to commit fully.
The stakes are not abstract. Training and inference budgets are colliding with climate and grid constraints. The field needs models that retain reasoning strength while cutting electricity bills and data center load.
Distillation is a practical lever-DeepSeek showed how
Chayes highlighted China's DeepSeek for leaning on "knowledge distillation" to train efficiently. In plain language, distillation is like a student querying a seasoned teacher until the student can match the teacher's performance. It's widely used, and it works. If you want a quick overview, here's a reference on knowledge distillation.
DeepSeek-R1 reportedly trained on distilled data from Alibaba's Qwen and Meta's Llama models, landing a total training cost near US$5.58 million-about 1.1% of the estimated US$500 million for Llama 3.1. That gap is a signal: efficiency is now a competitive edge, not just a nice-to-have.
Science domains need more than data-they need design
Distillation isn't only for mainstream LLMs. Chayes noted it's increasingly useful in scientific research, including her work in chemistry and materials. Data is sparse, and you can't generate enough clean lab results to build a massive "foundation chemistry model."
The path forward: combine learning with experiment loops and thoughtful post-training. How you distill, fine-tune, and integrate measurements matters as much as raw dataset size.
US-China dynamics: pressure, compute access, and talent
On export controls, Chayes said restrictions haven't obviously slowed Chinese researchers. If anything, constraints may push teams to innovate on efficiency under tight budgets and hardware limits. She also observed that some labs in China, including Tsinghua University, report stronger access to compute than many US universities.
Meanwhile, American universities face a different headwind: talent outflow to industry. That leaves faculty competing with tech companies for people and resources-an uneven match.
Policy moves around Nvidia chips
Policy keeps shifting. Reuters reported on January 30 that China had been given approval to buy Nvidia's H200 AI chips, while Nvidia's CEO said the company hadn't received confirmation and Beijing was still finalizing the license. The US had previously banned exports of H20 chips to China in April, then reversed that in July 2025. As of now, Washington allows exports of H200 chips to China.
A new Shaw Prize category for computer science
Before joining Berkeley in 2020, Chayes spent over two decades at Microsoft, founding the Theory Group at Microsoft Research and later establishing the New England and New York City labs. She has chaired the Turing Award selection committees.
She will chair the selection committee for the Shaw Prize's new computer science award, expanding a prize historically focused on mathematics, astronomy, and life science. The committee includes John Hennessy (Alphabet), Yann LeCun (Meta), and Harry Shum (formerly Microsoft AI and Research).
Chayes emphasized the committee's deep familiarity with AI progress in China. She worked with Kai-Fu Lee to open Microsoft Research Asia in Beijing in 1997-98 and has mentored many researchers across mainland labs for over two decades. Her view is clear: researchers from China often bring intense dedication and long hours-traits she appreciates.
The first computer science laureate will be named in spring 2027, based purely on scientific merit. Potential candidates span mainland China, Europe, and the US, including globally mobile scholars who have trained across regions.
She also mentioned long-standing ties with figures such as Ya-Qin Zhang (Tsinghua; former Baidu president) and Fei-Fei Li, with whom she serves on California's AI expert panel.
What this means for builders, researchers, and leaders
- Make energy use a core metric. Track efficiency per training step, per token, and per query-not just headline accuracy.
- Use distillation to shrink costs. Start with strong teachers, filter for quality, and validate student behavior under stress tests.
- In sparse-data fields, integrate experiment loops. Plan what to measure, when to fine-tune, and how to close the gap between simulation and the lab.
- Expect architecture shifts. Keep an eye on alternatives to transformers that promise similar reasoning with far lower energy demands.
- Factor policy into roadmaps. Access to specific accelerators may change on short notice; design for portability and efficiency.
Want structured upskilling?
If you're building capabilities around LLM training, distillation, or evaluation, these resources can help:
The takeaway is simple: scale won the last round. Efficiency will decide the next one. The teams that treat energy as a first-class constraint-and rethink the model stack accordingly-will set the pace.
Your membership also unlocks: