Titans + MIRAS: Helping AI have long-term memory
AI breaks down on very long inputs because traditional attention scales poorly with sequence length. Titans and the MIRAS framework attack that head-on by updating a model's core memory while it's running - no offline retraining loop, no static state that forgets the details.
Think of Titans as the tool and MIRAS as the blueprint. Together they bring RNN-like speed with transformer-level accuracy by learning what to keep (and what to drop) in real time.
Why long context is hard
Transformers compare every token to every other token. That gets expensive fast. Linear RNNs and state space models compress everything into a fixed-size state, which is efficient but loses nuance in very long streams like full documents, DNA, or multi-hour logs.
We need something that keeps the details that matter, scales linearly, and adapts on the fly. That's the gap Titans and MIRAS fill.
Titans in one line
Titans adds a neural long-term memory module - a deep multilayer perceptron - that summarizes the past and feeds that summary back into the context before attention runs. Attention can use it or ignore it. The key is that this memory is learned and updated during inference.
Surprise-driven updates (what gets stored, and why)
Titans uses a "surprise" signal based on the model's internal error (the gradient). If the next token is expected, nothing major changes. If the input conflicts with what the model believes, the surprise is high and the memory updates immediately.
Two refinements make this practical: momentum (carry forward recent surprise so related tokens are captured) and adaptive forgetting (weight decay) to clear space as sequences get extremely long.
MIRAS: the general blueprint
MIRAS reframes sequence models as associative memories. Different architectures are just different answers to the same question: how do we combine fresh input with prior state without losing essential information?
- Memory architecture: What holds information (vector, matrix, or a deep network like Titans).
- Attentional bias: The internal objective that decides what is prioritized.
- Retention gate: The regularizer that balances new learning against keeping what matters.
- Memory algorithm: The optimizer used to update memory.
Moving beyond one-size-fits-all losses
Most models lean on mean squared error or dot-product similarity for both bias and retention. That can overreact to outliers and limit flexibility. MIRAS opens a broader design space: non-Euclidean objectives, different regularizers, and more controlled updates.
Three MIRAS models (attention-free)
- YAAD: Uses Huber-style penalties to be less sensitive to outliers (like random typos or noisy spikes).
- MONETA: Tests stricter generalized norms for prioritization and forgetting to stabilize long-term memory behavior.
- MEMORA: Constrains memory to act like a probability map, making each update balanced and predictable.
What the experiments showed
Across language modeling datasets (C4, WikiText) and zero-shot reasoning (HellaSwag, PIQA), Titans and the MIRAS variants beat strong baselines including Transformer++, Mamba-2, and Gated DeltaNet of comparable size. Perplexity improved (lower is better; it's a measure of how "surprised" a model is by text), and accuracy rose on reasoning tasks.
Titans generalized beyond text to genomics and time-series forecasting. On extreme long-context tests like BABILong, it outperformed top baselines - including very large models - while using far fewer parameters. It also scaled to context windows beyond 2 million tokens.
Ablation studies highlighted a simple rule: for the same memory size, deeper memory modules worked better and kept performance steady as sequences grew.
Why this matters for applied research
- If your workloads involve very long documents, continuous logs, or streaming data, consider architectures that update memory at test time rather than storing a fixed state.
- Model behavior improves when the memory is deep, not just wide. Prioritize depth when tuning footprint.
- Use surprise thresholds and momentum to capture meaningful shifts, not just single spiky tokens. Pair with adaptive forgetting to manage capacity.
- When inputs are messy, losses like Huber (YAAD) help avoid overreacting to outliers. For stability, constraints like MEMORA's probability-style updates can keep learning smooth.
- Training remains parallelizable, and inference runs in linear time, which keeps deployment costs predictable.
Further reading
Get practical with long-context AI
If you want structured ways to skill up on sequence modeling and long-context systems, browse Complete AI Training: Courses by job. Pick a path that matches your role and build from there.
Your membership also unlocks: