Close Menu
    Facebook X (Twitter) Instagram
    SciTechDaily
    • Biology
    • Chemistry
    • Earth
    • Health
    • Physics
    • Science
    • Space
    • Technology
    Facebook X (Twitter) Pinterest YouTube RSS
    SciTechDaily
    Home»Technology»From Text to Trajectory: How MIT’s AI Masters Language-Guided Navigation
    Technology

    From Text to Trajectory: How MIT’s AI Masters Language-Guided Navigation

    By Adam Zewe, Massachusetts Institute of TechnologyAugust 17, 2024No Comments7 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn WhatsApp Email Reddit
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email Reddit
    Robot Maid Household Cleaning Vacuum
    MIT researchers have created an AI navigation method that utilizes language to guide robots, employing a large language model to process textual descriptions of visual scenes and generate navigation steps, simplifying the training process and enhancing adaptability to different environments.

    Researchers from MIT and the MIT-IBM Watson AI Lab have developed a novel AI navigation method that converts visual data into language descriptions to aid robots in navigating complex tasks.

    This approach uses a large language model to generate synthetic training data and make navigation decisions based on language inputs. Although not outperforming visual-based models, it offers the advantage of being less resource-intensive and easier to adapt to various tasks and environments.

    Someday, you may want your home robot to carry a load of dirty clothes downstairs and deposit them in the washing machine in the far-left corner of the basement. The robot will need to combine your instructions with its visual observations to determine the steps it should take to complete this task.

    For an AI agent, this is easier said than done. Current approaches often utilize multiple hand-crafted machine-learning models to tackle different parts of the task, which require a great deal of human effort and expertise to build. These methods, which use visual representations to directly make navigation decisions, demand massive amounts of visual data for training, which are often hard to come by.

    Integrating Language Models for Enhanced Navigation

    To overcome these challenges, researchers from MIT and the MIT-IBM Watson AI Lab devised a navigation method that converts visual representations into pieces of language, which are then fed into one large language model that achieves all parts of the multistep navigation task.

    Rather than encoding visual features from images of a robot’s surroundings as visual representations, which is computationally intensive, their method creates text captions that describe the robot’s point of view. A large language model uses the captions to predict the actions a robot should take to fulfill a user’s language-based instructions.

    Because their method utilizes purely language-based representations, they can use a large language model to efficiently generate a huge amount of synthetic training data.

    While this approach does not outperform techniques that use visual features, it performs well in situations that lack enough visual data for training. The researchers found that combining their language-based inputs with visual signals leads to better navigation performance.

    “By purely using language as the perceptual representation, ours is a more straightforward approach. Since all the inputs can be encoded as language, we can generate a human-understandable trajectory,” says Bowen Pan, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on this approach.

    Pan’s co-authors include his advisor, Aude Oliva, director of strategic industry engagement at the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL); Philip Isola, an associate professor of EECS and a member of CSAIL; senior author Yoon Kim, an assistant professor of EECS and a member of CSAIL; and others at the MIT-IBM Watson AI Lab and Dartmouth College. The research will be presented at the Conference of the North American Chapter of the Association for Computational Linguistics.

    Solving a Vision Problem With Language

    Since large language models are the most powerful machine-learning models available, the researchers sought to incorporate them into the complex task known as vision-and-language navigation, Pan says.

    However, such models take text-based inputs and can’t process visual data from a robot’s camera. So, the team needed to find a way to use language instead.

    Their technique utilizes a simple captioning model to obtain text descriptions of a robot’s visual observations. These captions are combined with language-based instructions and fed into a large language model, which decides what navigation step the robot should take next.

    The large language model outputs a caption of the scene the robot should see after completing that step. This is used to update the trajectory history so the robot can keep track of where it has been.

    Designing User-Friendly AI Navigation

    The model repeats these processes to generate a trajectory that guides the robot to its goal, one step at a time.

    To streamline the process, the researchers designed templates so observation information is presented to the model in a standard form — as a series of choices the robot can make based on its surroundings.

    For instance, a caption might say “to your 30-degree left is a door with a potted plant beside it, to your back is a small office with a desk and a computer,” etc. The model chooses whether the robot should move toward the door or the office.

    “One of the biggest challenges was figuring out how to encode this kind of information into language in a proper way to make the agent understand what the task is and how they should respond,” Pan says.

    Advantages of Language

    When they tested this approach, while it could not outperform vision-based techniques, they found that it offered several advantages.

    First, because text requires fewer computational resources to synthesize than complex image data, their method can be used to rapidly generate synthetic training data. In one test, they generated 10,000 synthetic trajectories based on 10 real-world, visual trajectories.

    The technique can also bridge the gap that can prevent an agent trained with a simulated environment from performing well in the real world. This gap often occurs because computer-generated images can appear quite different from real-world scenes due to elements like lighting or color. But language that describes a synthetic versus a real image would be much harder to tell apart, Pan says.

    Also, the representations their model uses are easier for a human to understand because they are written in natural language.

    “If the agent fails to reach its goal, we can more easily determine where it failed and why it failed. Maybe the history information is not clear enough or the observation ignores some important details,” Pan says.

    In addition, their method could be applied more easily to varied tasks and environments because it uses only one type of input. As long as data can be encoded as language, they can use the same model without making any modifications.

    But one disadvantage is that their method naturally loses some information that would be captured by vision-based models, such as depth information.

    However, the researchers were surprised to see that combining language-based representations with vision-based methods improves an agent’s ability to navigate.

    “Maybe this means that language can capture some higher-level information than cannot be captured with pure vision features,” he says.

    This is one area the researchers want to continue exploring. They also want to develop a navigation-oriented captioner that could boost the method’s performance. In addition, they want to probe the ability of large language models to exhibit spatial awareness and see how this could aid language-based navigation.

    Reference: “LangNav: Language as a Perceptual Representation for Navigation” by Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola and Yoon Kim, 30 March 2024, Computer Science > Computer Vision and Pattern Recognition.
    arXiv:2310.07889

    This research is funded, in part, by the MIT-IBM Watson AI Lab.

    Never miss a breakthrough: Join the SciTechDaily newsletter.
    Follow us on Google and Google News.

    Artificial Intelligence MIT Robotics
    Share. Facebook Twitter Pinterest LinkedIn Email Reddit

    Related Articles

    Robomorphic Computing: Designing Customized “Brains” for Robots

    What to Expect When You’re Expecting Robots: The Future of Human-Robot Collaboration

    Giving Robots Human-Like Perception of Their Physical Environments

    Robots Help Some Firms Thrive, While Workers Across Industries Struggle

    Showing Robots How to Do Your Chores – Automated Robots That Learn Just by Watching

    Innovative AI From MIT Helps Delivery Robots Find the Front Door [Video]

    Swarms of Self-Transforming Robot Blocks Unlock Stealthy Abilities to Accomplish Complex Tasks

    Machine-Learning Models Capture Subtle Variations in Facial Expressions

    Algorithm Enables Robots to Learn and Adapt to Help Complete Tasks

    Leave A Reply Cancel Reply

    • Facebook
    • Twitter
    • Pinterest
    • YouTube

    Don't Miss a Discovery

    Subscribe for the Latest in Science & Tech!

    Trending News

    Collapsing Plasma May Hold the Key to Cosmic Magnetism

    This Breakthrough Solar Panel Generates Power From Both Sunlight and Raindrops

    Scientists Uncover New Metabolic Effects Beyond Weight Loss of Mounjaro

    Scientists Discover Cancer Tumors Are “Addicted” to This Common Antioxidant

    1,800 Miles Down: Scientists Uncover Mysterious Movements at the Edge of Earth’s Core

    Scientists Discover Hidden “Good Fats” in Green Rice That Could Transform Nutrition

    Your Child’s Clothes Could Contain Toxic Lead, Study Finds

    Researchers Break a 150-Year-Old Math Law With a Surprising Donut Discovery

    Follow SciTechDaily
    • Facebook
    • Twitter
    • YouTube
    • Pinterest
    • Newsletter
    • RSS
    SciTech News
    • Biology News
    • Chemistry News
    • Earth News
    • Health News
    • Physics News
    • Science News
    • Space News
    • Technology News
    Recent Posts
    • Natural Oils vs. Antibiotics: The Swine Study That Could Change Farming
    • The Biggest Volcanic Event in Earth’s History Transformed an Entire Oceanic Plate
    • Scientists Warn: Humanity Has Pushed the Planet Past Its Limits
    • Stronger Flu Shot Linked to Nearly 55% Lower Alzheimer’s Risk, Study Finds
    • Researchers Say That Eating Mango With Avocado Offers Surprising Heart Benefits
    Copyright © 1998 - 2026 SciTechDaily. All Rights Reserved.
    • Science News
    • About
    • Contact
    • Editorial Board
    • Privacy Policy
    • Terms of Use

    Type above and press Enter to search. Press Esc to cancel.