MIT Gives Robots a Spatial Memory Grounded in Reality

DAAAM links language, vision and three-dimensional maps.

CAMBRIDGE, United States | June 2026

Researchers at the Massachusetts Institute of Technology have developed a robotic memory system that allows machines to remember where objects are located and answer questions about their surroundings in natural language. The framework, called DAAAM, combines computer vision, three-dimensional mapping and large language models. Its name stands for Describe Anything, Anywhere, Anytime, at Any Moment. The system is designed to help robots reason about space and time in a way that more closely resembles human memory.

A person working in a factory can often remember where a partially assembled component was left the previous evening. A robot operating in the same environment may recognize the object when it sees it, but struggle to connect that observation with a location and an earlier moment. DAAAM is intended to close that gap by building a language-based memory anchored to real sensor data. The robot can then respond to requests such as finding the component that was being assembled the night before.

The research was led by MIT graduate student Nicolas Gorlo and Lukas Schmid, a former MIT research scientist who is now a professor at the Nuremberg University of Technology. The work was supervised by Luca Carlone, an associate professor in MIT’s Department of Aeronautics and Astronautics and director of the SPARK Laboratory. The team presented the system at the Conference on Computer Vision and Pattern Recognition. Its technical findings were also made available through a scientific prepublication.

Traditional robotic maps are good at representing geometry but often contain little semantic detail. They may identify walls, rooms and obstacles without explaining what individual objects are or why they matter. Multimodal vision models can describe images with greater richness, but they usually process observations individually and may require too much computation for large environments. DAAAM combines these two capabilities within one structured memory.

As a robot moves through a space, the system detects objects and attaches detailed descriptions to them. It may record that a particular building is MIT’s Stata Center, identify its architectural characteristics and register nearby objects. In a bicycle parking area, it could note that five bicycles are present and that one red bicycle has a flat tire. The information is stored inside a three-dimensional map organized by spatial regions.

This structure allows the robot to connect descriptive and geographic information. It does not merely know that a damaged red bicycle exists, but also remembers that it was observed next to the Stata Center. That distinction is essential for useful interaction because human questions frequently combine objects, places and time. A robot must understand all three dimensions to provide an effective response.

DAAAM also addresses the speed limitations of previous systems. Earlier approaches capable of producing detailed descriptions could require several seconds to annotate only a few objects. That pace is inadequate when a robot may encounter hundreds of items during a short period of exploration. A practical memory system must process information continuously without interrupting navigation or delaying decisions.

The MIT framework groups nearby objects and uses an optimization method to select key frames from the robot’s camera feed. These are images offering the clearest view of several objects at the same time. The system can then describe multiple elements in parallel instead of analyzing each object through a separate image. This approach makes the process up to ten times faster in some conditions.

Each object is annotated only once, reducing repeated computation as the robot continues moving through the environment. Organizing objects by region also makes the memory easier to search. In comparative testing, DAAAM achieved accuracy improvements ranging from 21 to 53 percent over previous methods. Those gains suggest that the combination of language and structured mapping can produce more reliable spatial reasoning.

Once the map has been created, the system must retrieve information efficiently from a potentially enormous database. DAAAM uses a large language model equipped with several search and data-recovery tools. The model can decide whether a question requires semantic search, location-based retrieval or another method. This tool-based architecture is intended to reduce hallucinations by forcing the model to consult stored observations rather than inventing an answer.

A person might ask the robot about a sculpture seen near a university building. The system could search for objects described as sculptures or locate all items registered near that building. It can then combine those results and produce an answer within a few seconds. The response remains grounded in the robot’s own recorded experience.

Carlone has compared the concept to a conversational AI system connected directly to the physical world. Instead of answering only from text-based training data, the robot responds through memories created from cameras, sensors and maps. It could answer a question such as where someone left a wallet because it previously observed the wallet and registered its location. The objective is not human consciousness, but a practical form of spatiotemporal recall.

The technology could have applications beyond industrial robotics. Augmented-reality systems may use similar memory structures to guide maintenance technicians toward damaged equipment or previously detected anomalies. Navigation tools could help travelers and pedestrians understand unfamiliar environments through natural-language questions. Domestic robots could also locate frequently misplaced objects or remember changes inside a home.

The research team now wants DAAAM to record events as well as static objects. A future version could remember that a door was opened, a machine stopped functioning or a package was moved from one area to another. Researchers also plan to include confidence levels in the system’s answers. This would allow a robot to distinguish between a certain observation and an incomplete or ambiguous memory.

Significant challenges remain before this type of memory becomes common in general-purpose robots. Large environments generate enormous quantities of visual and spatial data, making long-term storage and retrieval difficult. Privacy also becomes a concern when machines continuously record homes, workplaces or public spaces. Developers will need rules determining what should be remembered, how long it should be retained and who may access it.

DAAAM nevertheless represents an important change in how robotic intelligence is designed. Instead of treating perception, mapping and language as separate functions, it connects them through a shared memory. The robot does not simply see an object or label it, but remembers where and when it appeared. That capability could move machines closer to becoming useful partners able to understand the same physical context as the people around them.

Memory becomes intelligence when it remains connected to reality. / La memoria se convierte en inteligencia cuando permanece conectada con la realidad.

Related posts

Floating Data Centers Turn Ocean Waves Into Computing Power

Old iPhones Face Hardware Flaw That Software Cannot Fix

Uber Prepares Electric Robotaxi Fleet With Lucid and Nuro