Rice University faculty wins big grant
Rice University’s Noah Harding Assistant Professor of Computer Science, Anastasios Kyrillidis, is using a recent Microsoft Research Award to explore complementary machine learning (ML) research axes: efficient large-scale training of neural networks and efficient adaptation of large neural network models to learn new tasks with the goal of continual learning.
Continual learning
He used the analogy of a family excursion to describe continual learning. “You and your family plan a visit to the zoo. Your children say ‘goodbye’ to the family dog before encountering animals they probably have not seen before in their life. After learning properties about those animals, they happily return home to play with their dog.
“This story makes a lot of sense for human beings, but for ML models that do a simple task (e.g., classifying animals), the leap is not trivial. You do not expect your kids to have forgotten the existence of your pet by going to the zoo; however, ML models easily forget how to classify older-seen objects, to create new flexibility for never-seen, new objects/tasks. This is the problem of catastrophic forgetting during training, and the problem of learning new tasks on the fly is the idea of continual learning.”
Kyrillidis said continual learning is a crucial component for more powerful and useful artificial intelligence, and his plan to explore interesting directions in continual learning is based on the work of one of his Rice CS Ph.D. students, Cameron Wolfe.
With the goal of continual learning in mind, Kyrillidis is looking more deeply at neural networks and focusing much of the Microsoft-funded work on two axes of research: efficient training and adaption for learning new tasks.
Efficient large-scale training of neural networks
He said, “There has been a long-term trend towards larger and more power-hungry Artificial Intelligence (AI) models. The well-known Generative Pre-trained Transformer-3 (GPT-3) model has around 175 billion parameters. The model powers over 300 applications, and ChatGPT is a popular variant of GPT-3. According to some reports, such models used thousands of megawatt hours of energy to train; enough energy to power more than a thousand US homes for a year.
“As models become larger, the power requirements will increase similarly. Training the next generation of ML models will consume the same power that more than a million homes consume in a year.”
Following Kyrillidis’ projection, it is easy for the mind to wander into the realm of a dystopian future where AI power requirements exceed the United States’ domestic power usage. This alarming energy consumption forecast is widely known and motivates many scientists to research models and processes to mitigate it.
“The goal of this current project is to explore approximate training dynamics in ML models, in order to reveal any potential performance/cost trade-offs that could be valuable in such extreme scenarios and reduce the overall cost in both time and energy,” said Kyrillidis. In addition to more efficient training on initial tasks, Kyrillidis is also pondering ways to reduce the resource costs associated with learning new tasks.
Efficient adaptation of large neural network models to learn new tasks
“If you check GitHub repositories available online, you will get an idea of the enormous amount of available online ML models that complete a specific task,” he said. “In the area of computer vision alone, we have solutions that classify specific objects, that classify generic classes of objects, that detect objects within images, that generate new
images, etc.”
“However, all these models are often independent of each other; the user must manually make ‘executive’ decisions about which model is used and when. So, the question of adapting to new tasks remains open. Can we exploit the existing models in a way that can be combined in a bigger ‘meta-model’ that on-the-fly decides how to combine existing knowledge instead of learning the new task from scratch?”
This is the second axis of Kyrillidis’ Microsoft-funded research. This direction spans cases including mixture of experts (MoE) modules and adapters in neural networks. In MoE modules, the idea is to cleverly train and combine (sub)models that “vote” at the end about the final answer. The idea behind adapters is to slightly modify or adapt existing models by changing only small parts of them or adding a small number of additional parameters that need to be tuned.
Kyrillidis said, “This is not a novel concept. It is the combination of ideas, of ‘fine tuning’ plus ‘emerging abilities’ where – especially at large-scale – models learn rich enough features during the initial training that can be cleverly used to complete tasks different from the ones initially trained.”
Working at the intersection of both axes – training and learning new tasks – gives Kyrillidis a unique view of how machines can accomplish continual learning. He said researchers seek ways to go ‘one step further’ with every new generation of models, hoping to achieve a continuously adapting model that seamlessly incorporates incoming data and handles data distribution shifts.
“If a ML model has trained at length on classifying cars and we suddenly insert images of houses, it will do a really bad job – which is reasonable. But if we decide to ‘learn’ about houses, we need to make sure we do not ‘forget’ about cars,” said Kyrillidis.
“One way to do so is to keep in our memory some representative images of cars, so that we can always revisit that part of memory and not learn about cars from scratch. At the same time, memory is not infinite; otherwise, we could always save everything there, and never ‘forget’ any of the previous tasks we learned.”
Kyrillidis said, “In the classical ML training, we usually perform a number of passes or ‘epochs’ over the available data. Now think what you can achieve with a single pass over the data, and in a way where the input stream might have a weird ordering: not necessarily images from random classes, but first the images of cars, then the images of houses, and so on.”
The challenge of continual learning, particularly in streaming continual learning, is how to train a ML model in a way such that an input sample can be seen once, can be saved once – but if erased from memory or was never stored, the same input will never be seen. And, of course, this should be done efficiently: one should not sacrifice excessive computational resources to achieve that.