Beyond Transformers: New AI Architectures Could Revolutionize Large Language Models

clock
2025-01-19 04:01:02

In the past weeks, researchers from Google and Sakana unveiled two cutting-edge neural network designs that could upend the AI industry.

These technologies aim to challenge the dominance of transformers—a type of neural network that connects inputs and outputs based on context—the technology that has defined AI for the past six years.

The new approaches are Google’s “Titans,” and “Transformers Squared,” which was designed by Sakana, a Tokyo AI startup known for using nature as its model for tech solutions. Indeed, both Google and Sakana tackled the transformer problem by studying the human brain. Their transformers basically utilize different stages of memory and activate different expert modules independently, instead of engaging the whole model at once for every problem.

The net result makes AI systems smarter, faster, and more versatile than ever before without making them necessarily bigger or more expensive to run.

For context, transformer architecture, the technology which gave ChatGPT the 'T' in its name, is designed for sequence-to-sequence tasks such as language modeling, translation, and image processing. Transformers rely on “attention mechanisms,” or tools to understand how important a concept is depending on a context, to model dependencies between input tokens, enabling them to process data in parallel rather than sequentially like so-called recurrent neural networks—the dominant technology in AI before transformers appeared. This technology gave models context understanding and marked a before and after moment in AI development.

However, despite their remarkable success, transformers faced significant challenges in scalability and adaptability. For models to be more flexible and versatile, they also need to be more powerful. So once they are trained, they cannot be improved unless developers come up with a new model or users rely on third-party tools. That’s why today, in AI, "bigger is better" is a general rule.

But this may change soon, thanks to Google and Sakana.

Google Research's Titans architecture takes a different approach to improving AI adaptability. Instead of modifying how models process information, Titans focuses on changing how they store and access it. The architecture introduces a neural long-term memory module that learns to memorize at test time, similar to how human memory works.

Currently, models read your entire prompt and output, predict a token, read everything again, predict the next token, and so on until they come up with the answer. They have an incredible short-term memory, but they suck at long-term memory. Ask them to remember things outside their context window, or very specific information in a bunch of noise, and they will probably fail.

Titans, on the other hand, combines three types of memory systems: short-term memory (similar to traditional transformers), long-term memory (for storing historical context), and persistent memory (for task-specific knowledge). This multi-tiered approach allows the model to handle sequences over 2 million tokens in length, far beyond what current transformers can process efficiently.

Image: Google
Image: Google

According to the research paper, Titans shows significant improvements in various tasks, including language modeling, common-sense reasoning, and genomics. The architecture has proven particularly effective at "needle-in-haystack" tasks, where it needs to locate specific information within very long contexts.

The system mimics how the human brain activates specific regions for different tasks and dynamically reconfigures its networks based on changing demands.

In other words, similar to how different neurons in your brain are specialized for distinct functions and are activated based on the task you're performing, Titans emulate this idea by incorporating interconnected memory systems. These systems (short-term, long-term, and persistent memories) work together to dynamically store, retrieve, and process information based on the task at hand.

Just two weeks after Google’s paper, a team of researchers from Sakana AI and the Institute of Science Tokyo introduced Transformer Squared, a framework that allows AI models to modify their behavior in real-time based on the task at hand. The system works by selectively adjusting only the singular components of their weight matrices during inference, making it more efficient than traditional fine-tuning methods.

Transformer Squared “employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific 'expert' vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt," according to the research paper.

It sacrifices inference time (it thinks more) for specialization (knowing which expertise to apply).

Image: Sakana AI
Image: Sakana AI

What makes Transformer Squared particularly innovative is its ability to adapt without requiring extensive retraining. The system uses what the researchers call Singular Value Fine-tuning (SVF), which focuses on modifying only the essential components needed for a specific task. This approach significantly reduces computational demands while maintaining or improving performance compared to current methods.

In testing, Sakana’s Transformer demonstrated remarkable versatility across different tasks and model architectures. The framework showed particular promise in handling out-of-distribution applications, suggesting it could help AI systems become more flexible and responsive to novel situations.

Here’s our attempt at an analogy. Your brain forms new neural connections when learning a new skill without having to rewire everything. When you learn to play piano, for instance, your brain doesn't need to rewrite all its knowledge—it adapts specific neural circuits for that task while maintaining other capabilities. Sakana’s idea was that developers don’t need to retrain the model’s entire network to adapt to new tasks.

Instead, the model selectively adjusts specific components (through Singular Value Fine-tuning) to become more efficient at particular tasks while maintaining its general capabilities.

Overall, the era of AI companies bragging over the sheer size of their models may soon be a relic of the past. If this new generation of neural networks gains traction, then future models won’t need to rely on massive scales to achieve greater versatility and performance.

Today, transformers dominate the landscape, often supplemented by external tools like Retrieval-Augmented Generation (RAG) or LoRAs to enhance their capabilities. But in the fast-moving AI industry, it only takes one breakthrough implementation to set the stage for a seismic shift—and once that happens, the rest of the field is sure to follow.

Edited by Andrew Hayward