Summary: A new AI model, based on the PV-RNN framework, learns to generalize language and actions in a manner similar to toddlers by integrating vision, proprioception, and language instructions. Unlike large language models (LLMs) that rely on vast datasets, this system uses embodied interactions to achieve compositionality while requiring less data and computational power.
Researchers found the AI’s modular, transparent design helpful for studying how humans acquire cognitive skills like combining language and actions. The model offers insights into developmental neuroscience and could lead to safer, more ethical AI by grounding learning in behavior and transparent decision-making processes.
Key Facts:
- Toddler-Like Learning: The AI learns compositionality by integrating sensory inputs, language, and actions.
- Transparent Design: Its architecture allows researchers to study internal decision-making pathways.
- Practical Benefits: Requires less data than LLMs and highlights ethical, embodied AI development.
Source: OIST
We humans excel at generalization. If you taught a toddler to identify the color red by showing her a red ball, a red truck and a red rose, she will most likely correctly identify the color of a tomato, even if it is the first time she sees one.
An important milestone in learning to generalize is compositionality: the ability to compose and decompose a whole into reusable parts, like the redness of an object. How we get this ability is a key question in developmental neuroscience – and in AI research.
The earliest neural networks, which have later evolved into the large language models (LLMs) revolutionizing our society, were developed to study how information is processed in our brains.
Ironically, as these models became more sophisticated, the information processing pathways within also became increasingly opaque, with some models today having trillions of tunable parameters.
But now, members of the Cognitive Neurorobotics Research Unit at the Okinawa Institute of Science and Technology (OIST) have created an embodied intelligence model with a novel architecture that allows researchers access to the various internal states of the neural network, and which appears to learn how to generalize in the same ways that children do.
Their findings have now been published in Science Robotics.
“This paper demonstrates a possible mechanism for neural networks to achieve compositionality,” says Dr. Prasanna Vijayaraghavan, first author of the study.
“Our model achieves this not by inference based on vast datasets, but by combining language with vision, proprioception, working memory, and attention – just like toddlers do.”
Perfectly imperfect
LLMs, founded on a transformer network architecture, learn the statistical relationship between words that appear in sentences from vast amounts of text data. They essentially have access to every word in every conceivable context, and from this understanding, they predict the most probable answer to a given prompt.
By contrast, the new model is based on a PV-RNN (Predictive coding inspired, Variational Recurrent Neural Network) framework, trained through embodied interactions integrating three simultaneous inputs related to different senses: vision, with a video of a robot arm moving colored blocks; proprioception, the sense of our limbs’ movement, with the joint angles of the robot arm as it moves; and a language instruction like “put red on blue.”
The model is then tasked to generate either a visual prediction and corresponding joint angles in response to a language instruction, or a language instruction in response to sensory input.
The system is inspired by the Free Energy Principle, which suggests that our brain continuously predicts sensory inputs based on past experiences and takes action to minimize the difference between prediction and observation.
This difference, quantified as ‘free energy’, is a measure of uncertainty, and by minimizing free energy, our brain maintains a stable state.
Together with limited working memory and attention span, the AI mirrors human cognitive constraints, forcing it to process input and update its prediction in sequence rather than all at once like LLMs do.
By studying the flow of information within the model, researchers can gain insights into how it integrates the various inputs to generate its simulated actions.
It is thanks to this modular architecture that the researchers have learned more about how infants may develop compositionality. As Dr. Vijayaraghavan recounts, “We found that the more exposure the model has to the same word in different contexts, the better it learns that word.
This mirrors real life, where a toddler will learn the concept of the color red much faster if she’s interacted with various red objects in different ways, rather than just pushing a red truck on multiple occasions.”
Opening the black box
“Our model requires a significantly smaller training set and much less computing power to achieve compositionality. It does make more mistakes than LLMs do, but it makes mistakes that are similar to how humans make mistakes,” says Dr. Vijayaraghavan.
It is exactly this feature that makes the model so useful to cognitive scientists, as well as to AI researchers trying to map the decision-making processes of their models.
While it serves a different purpose than the LLMs currently in use, and therefore cannot be meaningfully compared on effectiveness, the PV-RNN nevertheless shows how neural networks can be organized to offer greater insight into their information processing pathways: its relatively shallow architecture allows researchers to visualize the network’s latent state – the evolving internal representation of the information retained from the past and used in present predictions.
The model also addresses the Poverty of Stimulus problem, which posits that the linguistic input available to children is insufficient to explain their rapid language acquisition.
Despite having a very limited dataset, especially compared to LLMs, the model still achieves compositionality, suggesting that grounding language in behavior may be an important catalyst for the impressive language learning ability of children.
This embodied learning could moreover show the way for safer and more ethical AI in the future, both by improving transparency, and by it being able to better understand the effects of its actions.
Learning the word ‘suffering’ from a purely linguistic perspective, as LLMs do, would carry less emotional weight than for a PV-RNN, which learns the meaning through embodied experiences together with language.
“We are continuing our work to enhance the capabilities of this model and are using it to explore various domains of developmental neuroscience.
“We are excited to see what future insights into cognitive development and language learning processes we can uncover,” says Professor Jun Tani, head of the research unit and senior author on the paper.
How we acquire the intelligence to create our society is one of the great questions in science. While the PV-RNN hasn’t answered it, it opens new research avenues into how information is processed in our brain.
“By observing how the model learns to combine language and action,” summarizes Dr. Vijayaraghavan, “we gain insights into the fundamental processes that underlie human cognition.
“It has already taught us a lot about compositionality in language acquisition, and it showcases potential for more efficient, transparent, and safe models.”
About this AI and learning research news
Author: Jun Tani
Source: OIST
Contact: Jun Tani – OIST
Image: The image is credited to Neuroscience News
Original Research: Closed access.
“Development of compositionality through interactive learning of language and action of robots” by Prasanna Vijayaraghavan et al. Science Robotics
Abstract
Development of compositionality through interactive learning of language and action of robots
Humans excel at applying learned behavior to unlearned situations. A crucial component of this generalization behavior is our ability to compose/decompose a whole into reusable parts, an attribute known as compositionally.
One of the fundamental questions in robotics concerns this characteristic: How can linguistic compositionality be developed concomitantly with sensorimotor skills through associative learning, particularly when individuals only learn partial linguistic compositions and their corresponding sensorimotor patterns?
To address this question, we propose a brain-inspired neural network model that integrates vision, proprioception, and language into a framework of predictive coding and active inference on the basis of the free-energy principle.
The effectiveness and capabilities of this model were assessed through various simulation experiments conducted with a robot arm.
Our results show that generalization in learning to unlearned verb-noun compositions is significantly enhanced when training variations of task composition are increased.
We attribute this to self-organized compositional structures in linguistic latent state space being influenced substantially by sensorimotor learning.
Ablation studies show that visual attention and working memory are essential to accurately generate visuomotor sequences to achieve linguistically represented goals.
These insights advance our understanding of mechanisms underlying development of compositionality through interactions of linguistic and sensorimotor experience.