A new AI model, based on the PV-RNN framework, learns to generalize language and actions in a manner similar to toddlers by integrating vision, proprioception, and language instructions. Unlike large language models (LLMs) that rely on vast datasets, this system uses embodied interactions to achieve compositionality while requiring less data and computational power.