Dog-Inspired Robot: How Gestures Help AI Master the Art of “Fetch”

Summary: For humans, “fetch” is a simple game, but for robots, locating a specific object in a cluttered room is a computational nightmare. Researchers have found a surprising solution by looking at the world champions of fetch: dogs. By studying how dogs interpret human pointing gestures and gaze, the team developed a new AI framework called LEGS-POMDP.

This system allows robots to combine natural language (words) with physical gestures (pointing) to navigate uncertainty. In lab tests, the robot achieved an 89% success rate in finding correct objects—dramatically outperforming systems that rely on words or vision alone.

Key Facts

  • Multimodal Reasoning: The robot doesn’t just “hear” the command; it uses a “cone of probability” derived from the human’s eye, elbow, and wrist alignment to narrow down where the object is located.
  • The POMDP Framework: Robots use a “Partially Observable Markov Decision Process” to handle uncertainty. If the robot isn’t sure what it’s seeing, it moves to get a better view rather than making a blind guess.
  • Canine Inspiration: The gesture model was built using insights from the Brown Dog Lab, which studies how dogs intuitively solve cooperation problems with humans through gaze and gesture.
  • Performance Boost: Combining language and gesture led to a nearly 90% accuracy rate, proving that “showing” is just as important as “telling” when interacting with AI.
  • Vision-Language Model (VLM): The system integrates AI that can “see” a scene and understand complex natural language descriptions simultaneously.

Source: Brown University

Whether in the kitchen or on a workshop floor, robot assistants that can fetch items for people could be extremely useful. Now, a team of Brown University researchers has developed a way of making robots better at figuring out exactly which items a user might want them to retrieve.

The new approach enables robots to use inputs from both human language and gesture as they reason about how to locate and retrieve target objects. In a study that will be presented on Tuesday, March 17, during the International Conference on Human-Robot Interaction in Edinburgh, Scotland, the researchers show that the approach had an 89% success rate in finding the correct object in complex environments, outperforming other object retrieval approaches.

This shows a robot dog.
By incorporating canine-inspired models of human pointing and gaze, researchers have enabled robots to navigate partially observable environments with unprecedented accuracy. Credit: Neuroscience News

“Searching for things requires a robot to navigate large environments,” said Ivy He, a graduate student at Brown and the study’s lead author. “With current technology, robots are pretty good at identifying objects, but when the environment is cluttered, things are moving around or things are hidden by other objects, that makes things much more difficult. So this work is about using both language and gesture to help in that search task.”

The research makes use of an approach to robot planning called a POMDP (partially observable Markov decision process), a mathematical framework that allows a robot to reason under uncertainty. In the real world, robots rarely have a perfect understanding of the world. Different types of objects can look similar. There may be more than one of a particular object in a room. Items might be partially or completely hidden from view.

To succeed in a search, a robot has to act even when it isn’t sure what it’s seeing. Without a way to manage that uncertainty, it might freeze. Or worse, it might make overconfident final decisions based on incomplete information.

A POMDP turns ambiguities into a probabilistic framework that helps the robot track how confident it is about what’s in the world, and update those beliefs according to new information, including information from large vision and language models. In the process, it can choose actions that help it learn more — for example, moving to get a better view — before committing to a final decision.

The innovation in this latest research is a POMDP that incorporates inputs from both language and human gestures, such as pointing toward the object of interest. To incorporate the gesture component, He drew on insights from a Brown laboratory led by Associate Professor of Cognitive and Psychological Sciences Daphna Buchsbaum, on how the undisputed world champions of fetch — dogs — interpret human pointing.

Building on this expertise, He and Ph.D. student Madeline Pelgrim performed a study of the finer points of human pointing, as well as how dogs interpret pointing gestures. The study helped He to model the target of a pointing gesture within a cone of probability.

“What we have found is that humans use eye gaze to align with what they’re pointing to,” He said. “So it was natural to create a cone based on a connecting line from the eye to elbow to the wrist. That turns out to be a fairly good approximation of where someone is pointing.”

Buchsbaum adds, “Our work in the Brown Dog Lab has shown just how sophisticated dogs are in their communication with humans, solving many of the cooperation problems we want robots to solve. This makes them a natural model for intuitive human-non-human cooperation. This work translates the dog’s intuitive understanding of human gaze and pointing into a probabilistic model, which allows the robot to handle the ambiguity inherent in human communication. It moves us closer to truly intuitive robotic assistants.”

He then combined the gesture model with a vision language model or VLM, an AI system designed to interpret visual scenes together with natural language descriptions. The result was a POMDP capable of incorporating both language and gesture for robot planning.

In lab experiments, the researchers asked a quadruped robot to find various objects scattered around the lab space. The experiments showed that the robot was able to locate the correct object nearly 90% of time using combined gesture and language, far better than using either input alone.

For He and her coauthors, the research is a step toward robots that are able to operate side-by-side with people at home and in the workplace.

“The framework we developed helps pave the way for seamless multimodal human-robot interaction,” said research co-author Jason Liu, a postdoctoral researcher at MIT who worked on the project while completing his Ph.D. at Brown. “In the future, we can communicate with our assistant robots the same way people interact through language, gestures, eye gazes, demonstrations and much more.”

The work was supported through Brown’s AI Research Institute on Interaction for AI Assistants (ARIA), which is funded by the National Science Foundation.

“This is a really great illustration of how we can enable more natural and effective human-machine interaction by strengthening collaborations between computer science and cognitive science,” said Ellie Pavlick, an associate professor of computer science at Brown who leads ARIA. “Embracing what we know about how humans naturally want to communicate, and building systems aligned with those human tendencies and intuitions about behavior, is the right way forward.”

Funding: The work was supported by the National Science Foundation (2433429) and the Long-Term Autonomy for Ground and Aquatic Robotics program (GR5250131), and by the Office of Naval Research (N0001424-1-2784, N0001424-1-2603).

Key Questions Answered:

Q: Why do robots need to study dogs to find my keys?

A: Because humans are actually very “messy” communicators! We point vaguely or use words like “that thing over there.” Dogs have evolved over thousands of years to be experts at reading our body language and gaze. By teaching a robot to look at your eyes, elbow, and wrist—just like a dog does—scientists can help it understand exactly what “that thing” is, even in a messy room.

Q: Does the robot actually “think” like a dog?

A: Not exactly. It uses a mathematical framework called a POMDP. While a dog uses instinct, the robot uses probability. It calculates how confident it is about an object’s location. If its confidence is low, it “thinks” like a searcher: it will move around, peek under a table, or look from a different angle to gather more data before it makes its final choice.

Q: Is this just for fetching toys, or can it do real work?

A: This is a huge step toward robots in the home and workplace. Whether it’s a robot assistant in a hospital fetching a specific surgical tool or a home robot helping someone with mobility issues, being able to understand gestures means we don’t have to give perfectly programmed voice commands for every single task.

Editorial Notes:

  • This article was edited by a Neuroscience News editor.
  • Journal paper reviewed in full.
  • Additional context added by our staff.

About this AI and robotics research news

Author: Kevin Stacey
Source: Brown University
Contact: Kevin Stacey – Brown University
Image: The image is credited to Neuroscience News

Original Research: The findings will be presented at the ACM/IEEE International Conference on Human-Robot Interaction (HRI)

Join our Newsletter
I agree to have my personal information transferred to AWeber for Neuroscience Newsletter ( more information )
Sign up to receive our recent neuroscience headlines and summaries sent to your email once a day, totally free.
We hate spam and only use your email to contact you about newsletters. You can cancel your subscription any time.