Machine learning is a complex field that can often involve many factors and variables, making it difficult to diagnose issues or determine why a particular model is not performing as expected.
Researchers have explained how large language models, such as GPT-3, can learn new tasks without updating their parameters, despite not having been trained to do so. They discovered that large language models write smaller linear models inside their hidden layers, which the large models can train to perform a new task using simple learning algorithms.
Large language models, such as OpenAI’s GPT-3, are massive neural networks capable of producing human-like text, ranging from poetry to programming code. These machine-learning models, which have been trained using massive amounts of internet data, take a small amount of input text and predict what text is likely to come next.
But that’s not all these models are capable of. Researchers are investigating a strange phenomenon known as in-context learning, in which a large language model learns to perform a task after only seeing a few examples – despite the fact that it has never been trained for that task. For example, someone could feed the model several example sentences with their sentiments (positive or negative), then prompt it with a new sentence, and the model would respond with the appropriate sentiment.
A machine-learning model, such as GPT-3, would typically need to be retrained with new data for this new task. The model’s parameters are updated during the training process as it processes new information to learn the task. But with in-context learning, the model’s parameters aren’t updated, so it seems like the model learns a new task without learning anything at all.
People can now see how these models can learn from examples thanks to this work. So, my hope is that it alters some people’s perceptions of in-context learning. These models are not as stupid as people believe. These tasks are not simply memorized. They can learn new tasks, and we have demonstrated how.
Ekin Akyürek
MIT, Google Research, and Stanford University scientists are working to solve this mystery. They looked at models that were very similar to large language models to see how they could learn without changing parameters. Theoretical results from the researchers show that these massive neural network models can contain smaller, simpler linear models buried within them. The large model could then use a simple learning algorithm to train this smaller, linear model to complete a new task using only data from the larger model. Its parameters remain constant.
According to Ekin Akyürek, a computer science graduate student and lead author of a paper exploring this phenomenon, this research is an important step toward understanding the mechanisms behind in-context learning and opens the door to more exploration around the learning algorithms these large models can implement. Researchers may be able to enable models to complete new tasks without the need for costly retraining if they gain a better understanding of in-context learning.
“To fine-tune these models, you typically need to collect domain-specific data and perform complex engineering. But now we can just give it five examples and it will do what we want. So in-context learning is a fascinating phenomenon” According to Akyürek.
A model within a model
In the machine-learning research community, many scientists have come to believe that large language models can perform in-context learning because of how they are trained, Akyürek says. For instance, GPT-3 has hundreds of billions of parameters and was trained by reading huge swaths of text on the internet, from Wikipedia articles to Reddit posts. So, when someone shows the model examples of a new task, it has likely already seen something very similar because its training dataset included text from billions of websites. It repeats patterns it has seen during training, rather than learning to perform new tasks.
Akyürek hypothesized that in-context learners are learning to perform new tasks rather than simply matching previously seen patterns. He and others had experimented by providing these models with prompts using synthetic data that they had never seen before, and discovered that the models could still learn from just a few examples. Akyürek and his colleagues reasoned that these neural network models might contain smaller machine-learning models that the models could train to complete a new task.
“That could account for almost all of the learning phenomena observed with these large models,” he says. To put this hypothesis to the test, the researchers used a transformer neural network model, which has the same architecture as GPT-3 but has been specifically trained for in-context learning.
They theoretically proved that this transformer can write a linear model within its hidden states by investigating its architecture. A neural network is made up of many layers of interconnected data-processing nodes. The layers between the input and output layers are the hidden states. Their mathematical evaluations show that this linear model is written somewhere in the transformer’s earliest layers. The transformer can then use simple learning algorithms to update the linear model.
In essence, the model simulates and trains a smaller version of itself.
Probing hidden layers
The researchers investigated this hypothesis by conducting probing experiments in the transformer’s hidden layers in order to recover a specific quantity. “In this case, we attempted to recover the actual linear model solution, and we were able to demonstrate that the parameter is written in the hidden states. This implies that the linear model is present somewhere “he claims.
By adding just two layers to the neural network, the researchers may be able to enable a transformer to perform in-context learning based on their theoretical work. There are still many technical details to iron out before that is possible, warns Akyürek, but it could help engineers create models that can perform new tasks without having to retrain with new data.
Akyürek intends to pursue in-context learning with functions that are more complex than the linear models studied in this paper in the future. These experiments could also be applied to large language models to see if their behaviors are also described by simple learning algorithms. He also wants to delve deeper into the types of pretraining data that can be used to enable in-context learning.
“People can now see how these models can learn from examples thanks to this work. So, my hope is that it alters some people’s perceptions of in-context learning “According to Akyürek. “These models are not as stupid as people believe. These tasks are not simply memorized. They can learn new tasks, and we have demonstrated how.”