Big News! We've announced ProcTHOR

AI2-THOR Publications

+ Add Publication

+ More


Paper Details
Pin Environments
180 Publications, 529 Unique Authors
AI2-THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, Ali Farhadi ArXiv 2017
We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at this http URL AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
Visual recognition ecosystems (e.g. ImageNet, Pascal, COCO) have undeniably played a prevailing role in the evolution of modern computer vision. We argue that interactive and embodied visual AI has reached a stage of development similar to visual recognition prior to the advent of these ecosystems. Recently, various synthetic environments have been introduced to facilitate research in embodied AI. Notwithstanding this progress, the crucial question of how well models trained in simulation generalize to reality has remained largely unanswered. The creation of a comparable ecosystem for simulation-to-real embodied AI presents many challenges: (1) the inherently interactive nature of the problem, (2) the need for tight alignments between real and simulated worlds, (3) the difficulty of replicating physical conditions for repeatable experiments, (4) and the associated cost. In this paper, we introduce RoboTHOR to democratize research in interactive and embodied visual AI. RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world. As a first benchmark, our experiments show there exists a significant gap between the performance of models trained in simulation when they are tested in both simulations and their carefully constructed physical analogs. We hope that RoboTHOR will spur the next stage of evolution in embodied computer vision.
ManipulaTHOR: A Framework for Visual Object Manipulation
The domain of Embodied AI has recently witnessed substantial progress, particularly in navigating agents within their environments. These early successes have laid the building blocks for the community to tackle tasks that require agents to actively interact with objects in their environment. Object manipulation is an established research domain within the robotics community and poses several challenges including manipulator motion, grasping and long-horizon planning, particularly when dealing with oft-overlooked practical setups involving visually rich and complex scenes, manipulation using mobile agents (as opposed to tabletop manipulation), and generalization to unseen environments and objects. We propose a framework for object manipulation built upon the physics-enabled, visually rich AI2-THOR framework and present a new challenge to the Embodied AI community known as ArmPointNav. This task extends the popular point navigation task [2] to object manipulation and offers new challenges including 3D obstacle avoidance, manipulating objects in the presence of occlusion, and multi-object manipulation that necessitates long term planning. Popular learning paradigms that are successful on PointNav challenges show promise, but leave a large room for improvement.
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose PROCTHOR, a framework for procedural generation of Embodied AI environments. PROCTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks. We demonstrate the power and potential of PROCTHOR via a sample of 10,000 generated houses and a simple neural model. Models trained using only RGB images on PROCTHOR, with no explicit mapping and no human task supervision produce state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation, including the presently running Habitat 2022, AI2-THOR Rearrangement 2022, and RoboTHOR challenges. We also demonstrate strong 0-shot results on these benchmarks, via pre-training on PROCTHOR with no fine-tuning on the downstream benchmark, often beating previous state-of-the-art systems that access the downstream training data.
Target-driven visual navigation in indoor scenes using deep reinforcement learning
Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, Ali Farhadi IEEE International Conference on Robotics and Automation 2016
Two less addressed issues of deep reinforcement learning are (1) lack of generalization capability to new goals, and (2) data inefficiency, i.e., the model requires several (and often costly) episodes of trial and error to converge, which makes it impractical to be applied to real-world scenarios. In this paper, we address these two issues and apply our model to target-driven visual navigation. To address the first issue, we propose an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization. To address the second issue, we propose the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine. Our framework enables agents to take actions and interact with objects. Hence, we can collect a huge number of training samples efficiently. We show that our proposed method (1) converges faster than the state-of-the-art deep reinforcement learning methods, (2) generalizes across targets and scenes, (3) generalizes to a real robot scenario with a small amount of fine-tuning (although the model is trained in simulation), (4) is end-to-end trainable and does not need feature engineering, feature matching between frames or 3D reconstruction of the environment.
IQA: Visual Question Answering in Interactive Environments
We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR [35], a simulated photo-realistic environment of configurable indoor scenes with interactive objects. IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video:
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like “Rinse off a mug and place it in the coffee maker.” and low-level language instructions like “Walk to the coffee maker on the right.” ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision- and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
Visual Semantic Navigation using Scene Priors
Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, Roozbeh Mottaghi International Conference on Learning Representations 2018
How do humans navigate to target objects in novel scenes? Do we use the semantic/functional priors we have built over years to efficiently search and navigate? For example, to search for mugs, we search cabinets near the coffee machine and for fruits we try the fridge. In this work, we focus on incorporating semantic priors in the task of semantic navigation. We propose to use Graph Convolutional Networks for incorporating the prior knowledge into a deep reinforcement learning framework. The agent uses the features from the knowledge graph to predict the actions. For evaluation, we use the AI2-THOR framework. Our experiments show how semantic knowledge improves performance significantly. More importantly, we show improvement in generalization to unseen scenes and/or objects. The supplementary video can be accessed at the following link: this https URL .
Visual Semantic Planning Using Deep Successor Representations
Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, Ali Farhadi IEEE International Conference on Computer Vision 2017
A crucial capability of real-world intelligent agents is their ability to plan a sequence of actions to achieve their goals in the visual world. In this work, we address the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state. Doing so entails knowledge about objects and their affordances, as well as actions and their preconditions and effects. We propose learning these through interacting with a visual and dynamic environment. Our proposed solution involves bootstrapping reinforcement learning with imitation learning. To ensure cross task generalization, we develop a deep predictive model based on successor representations. Our experimental results show near optimal results across a wide range of tasks in the challenging THOR environment.
Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning
Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari, Ali Farhadi, Roozbeh Mottaghi Computer Vision and Pattern Recognition 2019
Learning is an inherently continuous phenomenon. When humans learn a new task there is no explicit distinction between training and inference. As we learn a task, we keep learning about it while performing the task. What we learn and how we learn it varies during different stages of learning. Learning how to learn and adapt is a key property that enables us to generalize effortlessly to new settings. This is in contrast with conventional settings in machine learning where a trained model is frozen during inference. In this paper we study the problem of learning to learn at both training and test time in the context of visual navigation. A fundamental challenge in navigation is generalization to unseen scenes. In this paper we propose a self-adaptive visual navigation method (SAVN) which learns to adapt to new environments without any explicit supervision. Our solution is a meta-reinforcement learning approach where an agent learns a self-supervised interaction loss that encourages effective navigation. Our experiments, performed in the AI2-THOR framework, show major improvements in both success rate and SPL for visual navigation in novel scenes. Our code and data are available at:
Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration
De-An Huang, Suraj Nair, Danfei Xu, Yuke Zhu, Animesh Garg, Li Fei-Fei, Silvio Savarese, Juan Carlos Niebles Computer Vision and Pattern Recognition 2019
Our goal is to generate a policy to complete an unseen task given just a single video demonstration of the task in a given domain. We hypothesize that to successfully generalize to unseen complex tasks from a single video demonstration, it is necessary to explicitly incorporate the compositional structure of the tasks into the model. To this end, we propose Neural Task Graph (NTG) Networks, which use conjugate task graph as the intermediate representation to modularize both the video demonstration and the derived policy. We empirically show NTG achieves inter-task generalization on two complex tasks: Block Stacking in BulletPhysics and Object Collection in AI2-THOR. NTG improves data efficiency with visual input as well as achieve strong generalization without the need for dense hierarchical supervision. We further show that similar performance trends hold when applied to real-world data. We show that NTG can effectively predict task structure on the JIGSAWS surgical dataset and generalize to unseen tasks.
SeGAN: Segmenting and Generating the Invisible
Objects often occlude each other in scenes; Inferring their appearance beyond their visible parts plays an important role in scene understanding, depth estimation, object interaction and manipulation. In this paper, we study the challenging problem of completing the appearance of occluded objects. Doing so requires knowing which pixels to paint (segmenting the invisible parts of objects) and what color to paint them (generating the invisible parts). Our proposed novel solution, SeGAN, jointly optimizes for both segmentation and generation of the invisible parts of objects. Our experimental results show that: (a) SeGAN can learn to generate the appearance of the occluded parts of objects; (b) SeGAN outperforms state-of-the-art segmentation baselines for the invisible parts of objects; (c) trained on synthetic photo realistic images, SeGAN can reliably segment natural images; (d) by reasoning about occluder-occludee relations, our method can infer depth layering.
Rearrangement: A Challenge for Embodied AI
We describe a framework for research and evaluation in Embodied AI. Our proposal is based on a canonical task: Rearrangement. A standard task can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings. In the rearrangement task, the goal is to bring a given physical environment into a specified state. The goal state can be specified by object poses, by images, by a description in language, or by letting the agent experience the environment in the goal state. We characterize rearrangement scenarios along different axes and describe metrics for benchmarking rearrangement performance. To facilitate research and exploration, we present experimental testbeds of rearrangement scenarios in four different simulation environments. We anticipate that other datasets will be released and new simulation platforms will be built to support training of rearrangement agents and their deployment on physical systems.
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, Joshua B. Tenenbaum IEEE International Conference on Robotics and Automation 2020
A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment and to make a sequence of actions to reach their goals. In this paper, we attempt to approach the problem of Audio-Visual Embodied Navigation, the task of planning the shortest path from a random starting location in a scene to the sound source in an indoor environment, given only raw egocentric visual and audio sensory data. To accomplish this task, the agent is required to learn from various modalities, i.e., relating the audio signal to the visual environment. Here we describe an approach to audio-visual embodied navigation that takes advantage of both visual and audio pieces of evidence. Our solution is based on three key ideas: a visual perception mapper module that constructs its spatial memory of the environment, a sound perception module that infers the relative location of the sound source from the agent, and a dynamic path planner that plans a sequence of actions based on the audio-visual observations and the spatial memory of the environment to navigate toward the goal. Experimental results on a newly collected Visual-Audio-Room dataset using the simulated multi-modal environment demonstrate the effectiveness of our approach over several competitive baselines.
Episodic Transformer for Vision-and-Language Navigation
Alexander Pashevich, Cordelia Schmid, Chen Sun IEEE International Conference on Computer Vision 2021
Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
Jesse Thomason, Daniel Gordon, Yonatan Bisk North American Chapter of the Association for Computational Linguistics 2019
We demonstrate the surprising strength of unimodal baselines in multimodal domains, and make concrete recommendations for best practices in future research. Where existing work often compares against random or majority class baselines, we argue that unimodal approaches better capture and reflect dataset biases and therefore provide an important comparison when assessing the performance of multimodal techniques. We present unimodal ablations on three recent datasets in visual navigation and QA, seeing an up to 29% absolute gain in performance over published baselines.
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht International Conference on Learning Representations 2021
Given a simple request (e.g., Put a washed apple in the kitchen fridge), humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld (Cote et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, visual scene understanding, and so forth).
Two Body Problem: Collaborative Visual Task Completion
Collaboration is a necessary skill to perform tasks that are beyond one agent's capabilities. Addressed extensively in both conventional and modern AI, multi-agent collaboration has often been studied in the context of simple grid worlds. We argue that there are inherently visual aspects to collaboration which should be studied in visually rich environments. A key element in collaboration is communication that can be either explicit, through messages, or implicit, through perception of the other agents and the visual world. Learning to collaborate in a visual environment entails learning (1) to perform the task, (2) when and what to communicate, and (3) how to act based on these communications and the perception of the visual world. In this paper we study the problem of learning to collaborate directly from pixels in AI2-THOR and demonstrate the benefits of explicit and implicit modes of communication to perform visual tasks. Refer to our project page for more details:
A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution
Valts Blukis, Chris Paxton, Dieter Fox, Animesh Garg, Yoav Artzi Conference on Robot Learning 2021
: Natural language provides an accessible and expressive interface to specify long-term tasks for robotic agents. However, non-experts are likely to specify such tasks with high-level instructions, which abstract over specific robot actions through several layers of abstraction. We propose that key to bridging this gap between language and robot actions over long execution horizons are persistent representations. We propose a persistent spatial semantic representation method, and show how it enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks. We evaluate our approach on the ALFRED benchmark and achieve state-of-the-art results, despite completely avoiding the commonly used step-by-step instructions. https://hlsm-alfred.
Unsupervised Reinforcement Learning of Transferable Meta-Skills for Embodied Navigation
Visual navigation is a task of training an embodied agent by intelligently navigating to a target object (e.g., television) using only visual observations. A key challenge for current deep reinforcement learning models lies in the requirements for a large amount of training data. It is exceedingly expensive to construct sufficient 3D synthetic environments annotated with the target object information. In this paper, we focus on visual navigation in the low-resource setting, where we have only a few training environments annotated with object information. We propose a novel unsupervised reinforcement learning approach to learn transferable meta-skills (e.g., bypass obstacles, go straight) from unannotated environments without any supervisory signals. The agent can then fast adapt to visual navigation through learning a high-level master policy to combine these meta-skills, when the visual-navigation-specified reward is provided. Experimental results show that our method significantly outperforms the baseline by 53.34% relatively on SPL, and further qualitative analysis demonstrates that our method learns transferable motor primitives for visual navigation.
Visual Room Rearrangement
There has been a significant recent progress in the field of Embodied AI with researchers developing models and algorithms enabling embodied agents to navigate and interact within completely unseen environments. In this paper, we propose a new dataset and baseline models for the task of Rearrangement. We particularly focus on the task of Room Rearrangement: an agent begins by exploring a room and recording objects’ initial configurations. We then remove the agent and change the poses and states (e.g., open/closed) of some objects in the room. The agent must restore the initial configurations of all objects in the room. Our dataset, named RoomR, includes 6,000 distinct rearrangement settings involving 72 different object types in 120 scenes. Our experiments show that solving this challenging interactive task that involves navigation and object interaction is beyond the capabilities of the current state-of-the-art techniques for embodied tasks and we are still very far from achieving perfect performance on these types of tasks.
AllenAct: A Framework for Embodied AI Research
The domain of Embodied AI, in which agents learn to complete tasks through interaction with their environment from egocentric observations, has experienced substantial growth with the advent of deep reinforcement learning and increased interest from the computer vision, NLP, and robotics communities. This growth has been facilitated by the creation of a large number of simulated environments (such as AI2-THOR, Habitat and CARLA), tasks (like point navigation, instruction following, and embodied question answering), and associated leaderboards. While this diversity has been beneficial and organic, it has also fragmented the community: a huge amount of effort is required to do something as simple as taking a model trained in one environment and testing it in another. This discourages good science. We introduce AllenAct, a modular and flexible learning framework designed with a focus on the unique requirements of Embodied AI research. AllenAct provides first-class support for a growing collection of embodied environments, tasks and algorithms, provides reproductions of state-of-the-art models and includes extensive documentation, tutorials, start-up code, and pre-trained models. We hope that our framework makes Embodied AI more accessible and encourages new researchers to join this exciting area. The framework can be accessed at: this https URL
Simple but Effective: CLIP Embeddings for Embodied AI
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to captioning and image manipulation. We investigate the effectiveness of CLIP visual backbones for Embodied AI tasks. We build incredibly simple baselines, named EmbCLIP, with no task specific architectures, inductive biases (such as the use of semantic maps), auxiliary tasks during training, or depth maps-yet we find that our improved baselines perform very well across a range of tasks and simulators. EmbCLIP tops the RoboTHOR ObjectNav leader-board by a huge margin of 20 pts (Success Rate). It tops the iTHOR 1-Phase Rearrangement leaderboard, beating the next best submission, which employs Active Neural Mapping, and more than doubling the % Fixed Strict metric (0.08 to 0.17). It also beats the winners of the 2021 Habitat ObjectNav Challenge, which employ auxiliary tasks, depth maps, and human demonstrations, and those of the 2019 Habitat PointNav Challenge. We evaluate the ability of CLIP's visual representations at capturing semantic information about input observations-primitives that are useful for navigation-heavy embodied tasks- and find that CLIP's representations encode these primitives more effectively than ImageNet-pretrained backbones. Finally, we extend one of our baselines, producing an agent capable of zero-shot object navigation that can navigate to objects that were not used as targets during training. Our code and models are available at
Learning Affordance Landscapes for Interaction Exploration in 3D Environments
Tushar Nagarajan, Kristen Grauman Neural Information Processing Systems 2020
Embodied agents operating in human spaces must be able to master how their environment works: what objects can the agent use, and how can it use them? We introduce a reinforcement learning approach for exploration for interaction, whereby an embodied agent autonomously discovers the affordance landscape of a new unmapped 3D environment (such as an unfamiliar kitchen). Given an egocentric RGB-D camera and a high-level action space, the agent is rewarded for maximizing successful interactions while simultaneously training an image-based affordance segmentation model. The former yields a policy for acting efficiently in new environments to prepare for downstream interaction tasks, while the latter yields a convolutional neural network that maps image regions to the likelihood they permit each action, densifying the rewards for exploration. We demonstrate our idea with AI2-iTHOR. The results show agents can learn how to use new home environments intelligently and that it prepares them to rapidly address various downstream tasks like "find a knife and put it in the drawer." Project page: this http URL
TEACh: Task-driven Embodied Agents that Chat
Robots operating in human spaces must be able to engage in natural language interaction, both understanding and executing instructions, and using conversation to resolve ambiguity and correct mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human-human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.
FILM: Following Instructions in Language with Modular Methods
So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, Ruslan Salakhutdinov International Conference on Learning Representations 2022
Recent methods for embodied instruction following are typically trained end-toend using imitation learning. This often requires the use of expert trajectories and low-level language instructions. Such approaches assume that neural states will integrate multimodal semantics to perform state tracking, building spatial memory, exploration, and long-term planning. In contrast, we propose a modular method with structured representations that (1) builds a semantic map of the scene and (2) performs exploration with a semantic search policy, to achieve the natural language goal. Our modular method achieves SOTA performance (24.46%) with a substantial (8.17 % absolute) gap from previous work while using less data by eschewing both expert trajectories and low-level instructions. Leveraging low-level language, however, can further increase our performance (26.49%).1 Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance, even in the absence of expert trajectories or low-level instructions.2
A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks
Autonomous agents must learn to collaborate. It is not scalable to develop a new centralized agent every time a task's difficulty outpaces a single agent's abilities. While multi-agent collaboration research has flourished in gridworld-like environments, relatively little work has considered visually rich domains. Addressing this, we introduce the novel task FurnMove in which agents work together to move a piece of furniture through a living room to a goal. Unlike existing tasks, FurnMove requires agents to coordinate at every timestep. We identify two challenges when training agents to complete FurnMove: existing decentralized action sampling procedures do not permit expressive joint action policies and, in tasks requiring close coordination, the number of failed actions dominates successful actions. To confront these challenges we introduce SYNC-policies (synchronize your actions coherently) and CORDIAL (coordination loss). Using SYNC-policies and CORDIAL, our agents achieve a 58% completion rate on FurnMove, an impressive absolute gain of 25 percentage points over competitive decentralized baselines. Our dataset, code, and pretrained models are available at this https URL.
RoboCSE: Robot Common Sense Embedding
Angel Daruna, Weiyu Liu, Zsolt Kira, Sonia Chernova IEEE International Conference on Robotics and Automation 2019
Autonomous service robots require computational frameworks that allow them to generalize knowledge to new situations in a manner that models uncertainty while scaling to real-world problem sizes. The Robot Common Sense Embedding (RoboCSE) showcases a class of computational frameworks, multi-relational embeddings, that have not been leveraged in robotics to model semantic knowledge. We validate RoboCSE on a realistic home environment simulator (AI2-THOR) to measure how well it generalizes learned knowledge about object affordances, locations, and materials. Our experiments show that RoboCSE can perform prediction better than a baseline that uses pre-trained embeddings, such as Word 2Vec, achieving statistically significant improvements while using orders of magnitude less memory than our Bayesian Logic Network baseline. In addition, we show that predictions made by RoboCSE are robust to significant reductions in data available for training as well as domain transfer to MatterPort3D, achieving statistically significant improvements over a baseline that memorizes training data.
Active Object Perceiver: Recognition-Guided Policy Learning for Object Searching on Mobile Robots
We study the problem of learning a navigation policy for a robot to actively search for an object of interest in an indoor environment solely from its visual inputs. While scene-driven visual navigation has been widely studied, prior efforts on learning navigation policies for robots to find objects are limited. The problem is often more challenging than target scene finding as the target objects can be very small in the view and can be in an arbitrary pose. We approach the problem from an active perceiver perspective, and propose a novel framework that integrates a deep neural network based object recognition module and a deep reinforcement learning based action prediction mechanism. To validate our method, we conduct experiments on both a simulation dataset (AI2-THOR)and a real-world environment with a physical robot. We further propose a new decaying reward function to learn the control policy specific to the object searching task. Experimental results validate the efficacy of our method, which outperforms competing methods in both average trajectory length and success rate.
Learning Object Relation Graph and Tentative Policy for Visual Navigation
Target-driven visual navigation aims at navigating an agent towards a given target based on the observation of the agent. In this task, it is critical to learn informative visual representation and robust navigation policy. Aiming to improve these two components, this paper proposes three complementary techniques, object relation graph (ORG), trial-driven imitation learning (IL), and a memory-augmented tentative policy network (TPN). ORG improves visual representation learning by integrating object relationships, including category closeness and spatial correlations, e.g., a TV usually co-occurs with a remote spatially. Both Trial-driven IL and TPN underlie robust navigation policy, instructing the agent to escape from deadlock states, such as looping or being stuck. Specifically, trial-driven IL is a type of supervision used in policy network training, while TPN, mimicking the IL supervision in unseen environment, is applied in testing. Experiment in the artificial environment AI2-THOR validates that each of the techniques is effective. When combined, the techniques bring significantly improvement over baseline methods in navigation effectiveness and efficiency in unseen environments. We report 22.8% and 23.5% increase in success rate and Success weighted by Path Length (SPL), respectively. The code is available at this https URL.
MOCA: A Modular Object-Centric Approach for Interactive Instruction Following
Performing simple household tasks based on language directives is very natural to humans, yet it remains an open challenge for an AI agent. Recently, an `interactive instruction following' task has been proposed to foster research in reasoning over long instruction sequences that requires object interactions in a simulated environment. It involves solving open problems in vision, language and navigation literature at each step. To address this multifaceted problem, we propose a modular architecture that decouples the task into visual perception and action policy, and name it as MOCA, a Modular Object-Centric Approach. We evaluate our method on the ALFRED benchmark and empirically validate that it outperforms prior arts by significant margins in all metrics with good generalization performance (high success rate in unseen environments). Our code is available at
Vision-based Navigation Using Deep Reinforcement Learning
Deep reinforcement learning (RL) has been successfully applied to a variety of game-like environments. However, the application of deep RL to visual navigation with realistic environments is a challenging task. We propose a novel learning architecture capable of navigating an agent, e.g. a mobile robot, to a target given by an image. To achieve this, we have extended the batched A2C algorithm with auxiliary tasks designed to improve visual navigation performance. We propose three additional auxiliary tasks: predicting the segmentation of the observation image and of the target image and predicting the depth-map. These tasks enable the use of supervised learning to pre-train a major part of the network and to reduce the number of training steps substantially. The training performance has been further improved by increasing the environment complexity gradually over time. An efficient neural network structure is proposed, which is capable of learning for multiple targets in multiple environments. Our method navigates in continuous state spaces and on the AI2-THOR environment simulator surpasses the performance of state-of-the-art goal-oriented visual navigation methods from the literature.
Visual Object Search by Learning Spatial Context
We present a visual navigation approach that uses context information to navigate an agent to find and reach a target object. To learn context from the objects present in the scene, we transform visual information into an intermediate representation called context grid which essentially represents how much the object at the location is semantically similar to the target object. As this representation can encode the target object and other objects together, it allows us to navigate an agent in a human-inspired way: the agent will go to the likely place by seeing surrounding context objects in the beginning when the target is not visible and, once the target object comes into sight, it will reach the target quickly. Since context grid does not directly contain visual or semantic feature values that change according to introductions of new objects, such as new instances of the same object with different appearance or an object from a slightly different class, our navigation model generalizes well to unseen scenes/objects. Experimental results show that our approach outperforms previous approaches in navigating in unseen scenes, especially for broad scenes. We also evaluated human performances in the target-driven navigation task and compared with machine learning based navigation approaches including this work.
PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World
We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don’t. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast what happens next given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.
Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring
Despite recent progress, learning new tasks through language instructions remains an extremely challenging problem. On the ALFRED benchmark for task learning, the published state-of-the-art system only achieves a task success rate of less than 10% in an unseen environment, compared to the human performance of over 90%. To address this issue, this paper takes a closer look at task learning. In a departure from a widely applied end-toend architecture, we decomposed task learning into three sub-problems: sub-goal planning, scene navigation, and object manipulation; and developed a model HiTUT1 (stands for Hierarchical Tasks via Unified Transformers) that addresses each sub-problem in a unified manner to learn a hierarchical task structure. On the ALFRED benchmark, HiTUT has achieved the best performance with a remarkably higher generalization ability. In the unseen environment, HiTUT achieves over 160% performance gain in success rate compared to the previous state of the art. The explicit representation of task structures also enables an in-depth understanding of the nature of the problem and the ability of the agent, which provides insight for future benchmark development and evaluation.
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
Language-guided robots performing home and office tasks must navigate in and interact with the world. Grounding language instructions against visual observations and actions to take in an environment is an open challenge. We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for languageconditioned task completion.Additionally, we bridge the gap between successful objectcentric navigation models used for noninteractive agents and the language-guided visual task completion benchmark, ALFRED, by introducing object navigation targets for EmBERT training. EmBERT achieves competitive performance on the ALFRED benchmark, and is the first model to use a full, pretrained BERT stack while handling the long-horizon, dense, multi-modal histories of ALFRED. Model code is available at the following link: amazon-research/embert
Learning Adaptive Language Interfaces through Decomposition
Our goal is to create an interactive natural language interface that efficiently and reliably learns from users to complete tasks in simulated robotics settings. We introduce a neural semantic parsing system that learns new high-level abstractions through decomposition: users interactively teach the system by breaking down high-level utterances describing novel behavior into low-level steps that it can understand. Unfortunately, existing methods either rely on grammars which parse sentences with limited flexibility, or neural sequence-to-sequence models that do not learn efficiently or reliably from individual examples. Our approach bridges this gap, demonstrating the flexibility of modern neural systems, as well as the one-shot reliable generalization of grammar-based methods. Our crowdsourced interactive experiments suggest that over time, users complete complex tasks more efficiently while using our system by leveraging what they just taught. At the same time, getting users to trust the system enough to be incentivized to teach high-level utterances is still an ongoing challenge. We end with a discussion of some of the obstacles we need to overcome to fully realize the potential of the interactive paradigm.
Skill Induction and Planning with Latent Language
Pratyusha Sharma, Antonio Torralba, Jacob Andreas Annual Meeting of the Association for Computational Linguistics 2022
We present a framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making. We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions, and these descriptions generate sequences of low-level actions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level sub-tasks, using only a small number of seed annotations to ground language in action. In trained models, natural language commands index a combinatorial library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals. We evaluate this approach in the ALFRED household simulation environment, providing natural language annotations for only 10% of demonstrations. It achieves performance comparable state-of-the-art models on ALFRED success rate, outperforming several recent methods with access to ground-truth plans during training and evaluation.
Visual Navigation with Spatial Attention
Bar Mayo, Tamir Hazan, Ayellet Tal Computer Vision and Pattern Recognition 2021
This work focuses on object goal visual navigation, aiming at finding the location of an object from a given class, where in each step the agent is provided with an egocentric RGB image of the scene. We propose to learn the agent’s policy using a reinforcement learning algorithm. Our key contribution is a novel attention probability model for visual navigation tasks. This attention encodes semantic information about observed objects, as well as spatial information about their place. This combination of the "what" and the "where" allows the agent to navigate toward the sought-after object effectively. The attention model is shown to improve the agent’s policy and to achieve state-of-the-art results on commonly-used datasets.
VTNet: Visual Transformer Network for Object Goal Navigation
Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors and then incorporates all the encoded descriptors through attention operations to achieve informative representation for navigation. Given such visual representations, agents are able to explore the correlations between visual observations and navigation actions. For example, an agent would prioritize ``turning right'' over ``turning left'' when the visual representation emphasizes on the right side of activation map. Experiments in the artificial environment AI2-THOR demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.
CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration
Households across the world contain arbitrary objects: from mate gourds and coffee mugs to sitars and guitars. Considering this diversity, robot perception must handle a large variety of semantic objects without additional fine-tuning to be broadly applicable in homes. Recently, zero-shot models have demonstrated impressive performance in image classification of arbitrary objects (i.e., classifying images at inference with categories not explicitly seen during training). In this paper, we translate the success of zero-shot vision models (e.g., CLIP) to the popular embodied AI task of object navigation. In our setting, an agent must find an arbitrary goal object, specified via text, in unseen environments coming from different datasets. Our key insight is to modularize the task into zero-shot object localization and exploration. Employing this philosophy, we design CLIP on Wheels (CoW) baselines for the task and evaluate each zero-shot model in both Habitat and RoboTHOR simulators. We find that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift. This CoW achieves 6.3% SPL in Habitat and 10.0% SPL in RoboTHOR, when tested zero-shot on all categories. On a subset of four RoboTHOR categories considered in prior work, the same CoW shows a 16.1 percentage point improvement in Success over the learnable state-of-the-art baseline.
Learning About Objects by Learning to Interact with Them
Martin Lohmann, Jordi Salvador, Aniruddha Kembhavi, Roozbeh Mottaghi Neural Information Processing Systems 2020
Much of the remarkable progress in computer vision has been focused around fully supervised learning mechanisms relying on highly curated datasets for a variety of tasks. In contrast, humans often learn about their world with little to no external supervision. Taking inspiration from infants learning from their environment through play and interaction, we present a computational framework to discover objects and learn their physical properties along this paradigm of Learning from Interaction. Our agent, when placed within the near photo-realistic and physics-enabled AI2-THOR environment, interacts with its world and learns about objects, their geometric extents and relative masses, without any external guidance. Our experiments reveal that this agent learns efficiently and effectively; not just for objects it has interacted with before, but also for novel instances from seen categories as well as novel object categories.
Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions
Peter A. Jansen FINDINGS 2020
The recently proposed ALFRED challenge task aims for a virtual robotic agent to complete complex multi-step everyday tasks in a virtual home environment from high-level natural language directives, such as “put a hot piece of bread on a plate”. Currently, the best-performing models are able to complete less than 1% of these tasks successfully. In this work we focus on modeling the translation problem of converting natural language directives into detailed multi-step sequences of actions that accomplish those goals in the virtual environment. We empirically demonstrate that it is possible to generate gold multi-step plans from language directives alone without any visual input in 26% of unseen cases. When a small amount of visual information, the starting location in the virtual environment, is incorporated, our best-performing GPT-2 model successfully generates gold command sequences in 58% of cases, suggesting contextualized language models may provide strong planning modules for grounded virtual agents.
CORA: Benchmarks, Baselines, and Metrics as a Platform for Continual Reinforcement Learning Agents
Progress in continual reinforcement learning has been limited due to several barriers to entry: missing code, high compute requirements, and a lack of suitable benchmarks. In this work, we present CORA, a platform for Continual Reinforcement Learning Agents that provides benchmarks, baselines, and metrics in a single code package. The benchmarks we provide are designed to evaluate different aspects of the continual RL challenge, such as catastrophic forgetting, plasticity, ability to generalize, and sample-efficient learning. Three of the benchmarks utilize video game environments (Atari, Procgen, NetHack). The fourth benchmark, CHORES, consists of four different task sequences in a visually realistic home simulator, drawn from a diverse set of task and scene parameters. To compare continual RL methods on these benchmarks, we prepare three metrics in CORA: continual evaluation, forgetting, and zero-shot forward transfer. Finally, CORA includes a set of performant, open-source baselines of existing algorithms for researchers to use and expand on. We release CORA and hope that the continual RL community can benefit from our contributions, to accelerate the development of new continual RL algorithms.
RobustNav: Towards Benchmarking Robustness in Embodied Navigation
As an attempt towards assessing the robustness of embodied navigation agents, we propose RobustNav, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of visual – affecting RGB inputs – and dynamics – affecting transition dynamics – corruptions. Most recent efforts in visual navigation have typically focused on generalizing to novel target environments with similar appearance and dynamics characteristics. With RobustNav, we find that some standard embodied navigation agents significantly underperform (or fail) in the presence of visual or dynamics corruptions. We systematically analyze the kind of idiosyncrasies that emerge in the behavior of such agents when operating under corruptions. Finally, for visual corruptions in RobustNav, we show that while standard techniques to improve robustness such as data-augmentation and self-supervised adaptation offer some zero-shot resistance and improvements in navigation performance, there is still a long way to go in terms of recovering lost performance relative to clean "non-corrupt" settings, warranting more research in this direction. Our code is available at
What Should I Do Now? Marrying Reinforcement Learning and Symbolic Planning
Long-term planning poses a major difficulty to many reinforcement learning algorithms. This problem becomes even more pronounced in dynamic visual environments. In this work we propose Hierarchical Planning and Reinforcement Learning (HIP-RL), a method for merging the benefits and capabilities of Symbolic Planning with the learning abilities of Deep Reinforcement Learning. We apply HIPRL to the complex visual tasks of interactive question answering and visual semantic planning and achieve state-of-the-art results on three challenging datasets all while taking fewer steps at test time and training in fewer iterations. Sample results can be found at
MGRL: Graph neural network based inference in a Markov network with reinforcement learning for visual navigation
Yi Lu, Yaran Chen, Dongbin Zhao, Dong Li Neurocomputing 2021
Visual navigation is an essential task for indoor robots and usually uses the map as assistance to providing global information for the agent. Because the traditional maps match the environments, the map-based and map-building-based navigation methods are limited in the new environments for obtaining maps. Although the deep reinforcement learning navigation method, utilizing the non-map-based navigation technique, achieves satisfactory performance, it lacks the interpretability and the global view of the environment. Therefore, we propose a novel abstract map for the deep reinforcement learning navigation method with better global relative position information and more reasonable interpretability. The abstract map is modeled as a Markov network which is used for explicitly representing the regularity of objects arrangement, influenced by people activities in different environments. Besides, a knowledge graph is utilized to initialize the structure of the Markov network, as providing the prior structure for the model and reducing the difficulty of model learning. Then, a graph neural network is adopted for probability inference in the Markov network. Furthermore, the update of the abstract map, including the knowledge graph structure and the parameters of the graph neural network, are combined into an end-to-end learning process trained by a reinforcement learning method. Finally, experiments in the AI2-THOR framework and the physical environment indicate that our algorithm greatly improves the success rate of navigation in case of new environments, thus confirming the good generalization.
Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks
There is a growing interest in the community in making an embodied AI agent perform a complicated task while interacting with an environment following natural language directives. Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task, but achieved only very low accuracy. This paper proposes a new method, which outperforms the previous methods by a large margin. It is based on a combination of several new ideas. One is a two-stage interpretation of the provided instructions. The method first selects and interprets an instruction without using visual information, yielding a tentative action sequence prediction. It then integrates the prediction with the visual information etc., yielding the final prediction of an action and an object. As the object’s class to interact is identified in the first stage, it can accurately select the correct object from the input image. Moreover, our method considers multiple egocentric views of the environment and extracts essential information by applying hierarchical attention conditioned on the current instruction. This contributes to the accurate prediction of actions for navigation. A preliminary version of the method won the ALFRED Challenge 2020. The current version achieves the unseen environment’s success rate of 4.45% with a single view, which is further improved to 8.37% with multiple views.
Learning Neuro-Symbolic Relational Transition Models for Bilevel Planning
—In robotic domains, learning and planning are complicated by continuous state spaces, continuous action spaces, and long task horizons. In this work, we address these challenges with Neuro-Symbolic Relational Transition Models (NSRTs), a novel class of models that are data-efficient to learn, compatible with powerful robotic planning methods, and generalizable over objects. NSRTs have both symbolic and neural components, enabling a bilevel planning scheme where symbolic AI planning in an outer loop guides continuous planning with neural models in an inner loop. Experiments in four robotic planning domains show that NSRTs can be learned very data-efficiently, and then used for fast planning in new tasks that require up to 60 actions and involve many more objects than were seen during training.
Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following
We address the interactive instruction following task [4, 9, 8] which requires an agent to navigate through an environment, interact with objects, and complete long-horizon tasks, following natural language instructions with egocentric vision. To successfully achieve a goal in the interactive instruction following task, the agent should infer a sequence of actions and object interactions. When performing actions, a small field of view often limits the agent’s understanding of an environment, leading to poor performance. Here, we propose to exploit surrounding views by additional observations from navigable directions to enlarge the field of view of the agent. In addition to the ample observations, while action prediction requires global semantic cues, object localization needs a pixel-level understanding of the environment, making them semantically different tasks. Thus, we design a model factorizing interactive perception and action policy in separate streams in a unified end-to-end framework. The proposed method outperforms the previous challenge winner method [7].
Learning hierarchical relationships for object-goal navigation
Direct search for objects as part of navigation poses a challenge for small items. Utilizing context in the form of object-object relationships enable hierarchical search for targets efficiently. Most of the current approaches tend to directly incorporate sensory input into a reward-based learning approach, without learning about object relationships in the natural environment, and thus generalize poorly across domains. We present Memory-utilized Joint hierarchical Object Learning for Navigation in Indoor Rooms (MJOLNIR), a target-driven navigation algorithm, which considers the inherent relationship between target objects, and the more salient contextual objects occurring in its surrounding. Extensive experiments conducted across multiple environment settings show an 82.9% and 93.5% gain over existing state-of-the-art navigation methods in terms of the success rate (SR), and success weighted by path length (SPL), respectively. We also show that our model learns to converge much faster than other algorithms, without suffering from the well-known overfitting problem. Additional details regarding the supplementary material and code are available at
Exploring the Task Cooperation in Multi-goal Visual Navigation
Learning to adapt to a series of different goals in visual navigation is challenging. In this work, we present a model-embedded actor-critic architecture for the multi-goal visual navigation task. To enhance the task cooperation in multi-goal learning, we introduce two new designs to the reinforcement learning scheme: inverse dynamics model (InvDM) and multi-goal co-learning (MgCl). Specifically, InvDM is proposed to capture the navigation-relevant association between state and goal, and provide additional training signals to relieve the sparse reward issue. MgCl aims at improving the sample efficiency and supports the agent to learn from unintentional positive experiences. Extensive results on the interactive platform AI2-THOR demonstrate that the proposed method converges faster than state-of-the-art methods while producing more direct routes to navigate to the goal. The video demonstration is available at:
PlaTe: Visually-Grounded Planning With Transformers in Procedural Tasks
In this work, we study the problem of how to leverage instructional videos to facilitate the understanding of human decision-making processes, focusing on training a model with the ability to plan a goal-directed procedure from real-world videos. Learning structured and plannable state and action spaces directly from unstructured videos is the key technical challenge of our task. There are two problems: first, the appearance gap between the training and validation datasets could be large for unstructured videos; second, these gaps lead to decision errors that compound over the steps. We address these limitations with Planning Transformer (PlaTe), which has the advantage of circumventing the compounding prediction errors that occur with single-step models during long model-based rollouts. Our method simultaneously learns the latent state and action information of assigned tasks and the representations of the decision-making process from human demonstrations. Experiments conducted on real-world instructional videos show that our method can achieve a better performance in reaching the indicated goal than previous algorithms. We also validated the possibility of applying procedural tasks on a UR-5 platform. Please see 1
Multi-agent Embodied Question Answering in Interactive Environments
We investigate a new AI task — Multi-Agent Interactive Question Answering — where several agents explore the scene jointly in interactive environments to answer a question. To cooperate efficiently and answer accurately, agents must be well-organized to have balanced work division and share knowledge about the objects involved. We address this new problem in two stages: Multi-Agent 3D Reconstruction in Interactive Environments and Question Answering. Our proposed framework features multi-layer structural and semantic memories shared by all agents, as well as a question answering model built upon a 3D-CNN network to encode the scene memories. During the reconstruction, agents simultaneously explore and scan the scene with a clear division of work, organized by next viewpoints planning. We evaluate our framework on the IQuADv1 dataset and outperform the IQA baseline in a single-agent scenario. In multi-agent scenarios, our framework shows favorable speedups while remaining high accuracy.
GridToPix: Training Embodied Agents with Minimal Supervision
Unnat Jain, Iou-Jen Liu, Svetlana Lazebnik, Aniruddha Kembhavi, Luca Weihs, Alexander Schwing IEEE International Conference on Computer Vision 2021
While deep reinforcement learning (RL) promises freedom from hand-labeled data, great successes, especially for Embodied AI, require significant work to create supervision via carefully shaped rewards. Indeed, without shaped rewards, i.e., with only terminal rewards, present-day Embodied AI results degrade significantly across Embodied AI problems from single-agent Habitat-based PointGoal Navigation (SPL drops from 55 to 0) and two-agent AI2-THOR-based Furniture Moving (success drops from 58% to 1%) to three-agent Google Football-based 3 vs. 1 with Keeper (game score drops from 0.6 to 0.1). As training from shaped rewards doesn’t scale to more realistic tasks, the community needs to improve the success of training with terminal rewards. For this we propose GRIDTOPIX: 1) train agents with terminal rewards in gridworlds that generically mirror Embodied AI environments, i.e., they are independent of the task; 2) distill the learned policy into agents that reside in complex visual worlds. Despite learning from only terminal rewards with identical models and RL algorithms, GRIDTOPIX significantly improves results across tasks: from PointGoal Navigation (SPL improves from 0 to 64) and Furniture Moving (success improves from 1% to 25%) to football gameplay (game score improves from 0.1 to 0.6). GRIDTOPIX even helps to improve the results of shaped reward training.
Bridging the Imitation Gap by Adaptive Insubordination
Why do agents often obtain better reinforcement learning policies when imitating a worse expert? We show that privileged information used by the expert is marginalized in the learned agent policy, resulting in an "imitation gap." Prior work bridges this gap via a progression from imitation learning to reinforcement learning. While often successful, gradual progression fails for tasks that require frequent switches between exploration and memorization skills. To better address these tasks and alleviate the imitation gap we propose 'Adaptive Insubordination' (ADVISOR), which dynamically reweights imitation and reward-based reinforcement learning losses during training, enabling switching between imitation and exploration. On a suite of challenging tasks, we show that ADVISOR outperforms pure imitation, pure reinforcement learning, as well as sequential combinations of these approaches.
Reinforcement Learning Based Navigation with Semantic Knowledge of Indoor Environments
Tai T. L. Nguyen, Do-Van Nguyen, T. Le International Conference on Knowledge and Systems Engineering 2019
Recent years have been witnessing a huge step of artificial intelligence towards being applied in autonomous robots. To build intelligent robots navigating in indoor environment, many research focus on deep reinforcement learning which help robot learn and plan by themselves. Different network architectures are proposed for training agents to navigate and find targeted objects in both real and simulated environments. Despite promising results, one key challenge remaining is that the agent has to perform well in unseen environments and objects. To solve this generalization problem, this work proposes a method using prior knowledge graph capturing relationships between target objects. Experiments on simulated environments show that not only the proposed method enhances the learning process but also significantly improves agents generalization. When compared to similar methods, proposed method has a competitive and even better performance while bringing computational advantages.
Towards Target-Driven Visual Navigation in Indoor Scenes via Generative Imitation Learning
We present a target-driven navigation system to improve mapless visual navigation in indoor scenes. Our method takes a multi-view observation of a robot and a target image as inputs at each time step to provide a sequence of actions that move the robot to the target without relying on odometry or GPS at runtime. The system is learned by optimizing a combinational objective encompassing three key designs. First, we propose that an agent conceives the next observation before making an action decision. This is achieved by learning a variational generative module from expert demonstrations. We then propose predicting static collision in advance, as an auxiliary task to improve safety during navigation. Moreover, to alleviate the training data imbalance problem of termination action prediction, we also introduce a target checking module to differentiate from augmenting navigation policy with a termination action. The three proposed designs all contribute to the improved training data efficiency, static collision avoidance, and navigation generalization performance, resulting in a novel target-driven mapless navigation system. Through experiments on a TurtleBot, we provide evidence that our model can be integrated into a robotic system and navigate in the real world. Videos and models can be found in the supplementary material.
Synthesize Policies for Transfer and Adaptation across Tasks and Environments
The ability to transfer in reinforcement learning is key towards building an agent of general artificial intelligence. In this paper, we consider the problem of learning to simultaneously transfer across both environments and tasks, probably more importantly, by learning from only sparse (environment, task) pairs out of all the possible combinations. We propose a novel compositional neural network architecture which depicts a meta rule for composing policies from environment and task embeddings. Notably, one of the main challenges is to learn the embeddings jointly with the meta rule. We further propose new training methods to disentangle the embeddings, making them both distinctive signatures of the environments and tasks and effective building blocks for composing the policies. Experiments on GridWorld and THOR, of which the agent takes as input an egocentric view, show that our approach gives rise to high success rates on all the (environment, task) pairs after learning from only 40% of them.
Reinforcement Learning-Based Visual Navigation With Information-Theoretic Regularization
To enhance the cross-target and cross-scene generalization of target-driven visual navigation based on deep reinforcement learning (RL), we introduce an information-theoretic regularization term into the RL objective. The regularization maximizes the mutual information between navigation actions and visual observation transforms of an agent, thus promoting more informed navigation decisions. This way, the agent models the action-observation dynamics by learning a variational generative model. Based on the model, the agent generates (imagines) the next observation from its current observation and navigation target. This way, the agent learns to understand the causality between navigation actions and the changes in its observations, which allows the agent to predict the next action for navigation by comparing the current and the imagined next observations. Cross-target and cross-scene evaluations on the AI2-THOR framework show that our method attains at least 10% improvement of average success rate over some state-of-the-art models. We further evaluate our model in two real-world settings: navigation in unseen indoor scenes from a discrete Active Vision Dataset (AVD) and continuous real-world environments with a TurtleBot. We demonstrate that our navigation model is able to successfully achieve navigation tasks in these scenarios.11[Online]. Available:
Artificial Agents Learn Flexible Visual Representations by Playing a Hiding Game
The ubiquity of embodied gameplay, observed in a wide variety of animal species including turtles and ravens, has led researchers to question what advantages play provides to the animals engaged in it. Mounting evidence suggests that play is critical in developing the neural flexibility for creative problem solving, socialization, and can improve the plasticity of the medial prefrontal cortex. Comparatively little is known regarding the impact of gameplay upon embodied artificial agents. While recent work has produced artificial agents proficient in abstract games, the environments these agents act within are far removed the real world and thus these agents provide little insight into the advantages of embodied play. Hiding games have arisen in multiple cultures and species, and provide a rich ground for studying the impact of embodied gameplay on representation learning in the context of perspective taking, secret keeping, and false belief understanding. Here we are the first to show that embodied adversarial reinforcement learning agents playing cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn representations of their observations encoding information such as occlusion, object permanence, free space, and containment; on par with representations learnt by the most popular modern paradigm for visual representation learning which requires large datasets independently labeled for each new task. Our representations are enhanced by intent and memory, through interaction and play, moving closer to biologically motivated learning strategies. These results serve as a model for studying how facets of vision and perspective taking develop through play, provide an experimental framework for assessing what is learned by artificial agents, and suggest that representation learning should move from static datasets and towards experiential, interactive, learning.
Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments
This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-based VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method. Our method creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and language modalities to enable effective learning of VLN policy. Empirical results in a photo-realistic long-horizon simulation environment show that the proposed approach outperforms a variety of state-of-the-art methods and baselines with over 22% relative improvement in SPL in prior unseen environments.
Are We There Yet? Learning to Localize in Embodied Instruction Following
Embodied instruction following is a challenging problem requiring an agent to infer a sequence of primitive actions to achieve a goal environment state from complex language and visual inputs. Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem consisting of step-by-step natural language instructions to achieve subgoals which compose to an ultimate high-level goal. Key challenges for this task include localizing target locations and navigating to them through visual inputs, and grounding language instructions to visual appearance of objects. To address these challenges, in this study, we augment the agent’s field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep. We also improve language grounding by introducing a pre-trained object detection module to the model pipeline. Empirical studies show that our approach exceeds the baseline model performance.
Visual Reaction: Learning to Play Catch With Your Drone
Kuo-Hao Zeng, Roozbeh Mottaghi, Luca Weihs, Ali Farhadi Computer Vision and Pattern Recognition 2020
In this paper we address the problem of visual reaction: the task of interacting with dynamic environments where the changes in the environment are not necessarily caused by the agents itself. Visual reaction entails predicting the future changes in a visual environment and planning accordingly. We study the problem of visual reaction in the context of playing catch with a drone in visually rich synthetic environments. This is a challenging problem since the agent is required to learn (1) how objects with different physical properties and shapes move, (2) what sequence of actions should be taken according to the prediction, (3) how to adjust the actions based on the visual feedback from the dynamic environment (e.g., when objects bouncing off a wall), and (4) how to reason and act with an unexpected state change in a timely manner. We propose a new dataset for this task, which includes 30K throws of 20 types of objects in different directions with different forces. Our results show that our model that integrates a forecaster with a planner outperforms a set of strong baselines that are based on tracking as well as pure model-based and model-free RL baselines. The code and dataset are available at
Hierarchical Object-to-Zone Graph for Object Navigation
The goal of object navigation is to reach the expected objects according to visual information in the unseen environments. Previous works usually implement deep models to train an agent to predict actions in real-time. However, in the unseen environment, when the target object is not in egocentric view, the agent may not be able to make wise decisions due to the lack of guidance. In this paper, we propose a hierarchical object-to-zone (HOZ) graph to guide the agent in a coarse-to-fine manner, and an online-learning mechanism is also proposed to update HOZ according to the real-time observation in new environments. In particular, the HOZ graph is composed of scene nodes, zone nodes and object nodes. With the pre-learned HOZ graph, the real-time observation and the target goal, the agent can constantly plan an optimal path from zone to zone. In the estimated path, the next potential zone is regarded as sub-goal, which is also fed into the deep reinforcement learning model for action prediction. Our methods are evaluated on the AI2-THOR simulator. In addition to widely used evaluation metrics SR and SPL, we also propose a new evaluation metric of SAE that focuses on the effective action rate. Experimental results demonstrate the effectiveness and efficiency of our proposed method. The code is available at
NeoNav: Improving the Generalization of Visual Navigation via Generating Next Expected Observations
Qiaoyun Wu, Dinesh Manocha, Jun Wang, Kai Xu AAAI Conference on Artificial Intelligence 2020
We propose improving the cross-target and cross-scene generalization of visual navigation through learning an agent that is guided by conceiving the next observations it expects to see. This is achieved by learning a variational Bayesian model, called NeoNav, which generates the next expected observations (NEO) conditioned on the current observations of the agent and the target view. Our generative model is learned through optimizing a variational objective encompassing two key designs. First, the latent distribution is conditioned on current observations and the target view, leading to a model-based, target-driven navigation. Second, the latent space is modeled with a Mixture of Gaussians conditioned on the current observation and the next best action. Our use of mixture-of-posteriors prior effectively alleviates the issue of over-regularized latent space, thus significantly boosting the model generalization for new targets and in novel scenes. Moreover, the NEO generation models the forward dynamics of agent-environment interaction, which improves the quality of approximate inference and hence benefits data efficiency. We have conducted extensive evaluations on both real-world and synthetic benchmarks, and show that our model consistently outperforms the state-of-the-art models in terms of success rate, data efficiency, and generalization.
Continuous Scene Representations for Embodied AI
We propose Continuous Scene Representations (CSR), a scene representation constructed by an embodied agent navigating within a space, where objects and their relationships are modeled by continuous valued embeddings. Our method captures feature relationships between objects, composes them into a graph structure on-the-fly, and situates an embodied agent within the representation. Our key insight is to embed pair-wise relationships between objects in a latent space. This allows for a richer representation compared to discrete relations (e.g., [SUPPORT], [NEXT-TO]) commonly used for building scene representations. CSR can track objects as the agent moves in a scene, update the representation accordingly, and detect changes in room configurations. Using CSR, we outperform state-of-the-art approaches for the challenging downstream task of visual room rearrangement, without any task specific training. Moreover, we show the learned embeddings capture salient spatial details of the scene and show applicability to real world data. A summery video and code is available at
Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships
Yunlian Lv, Ning Xie, Yimin Shi, Zijiao Wang, Heng Tao Shen Neural Processing Letters 2020
Embodied artificial intelligence (AI) tasks shift from tasks focusing on internet images to active settings involving embodied agents that perceive and act within 3D environments. In this paper, we investigate the target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes, whose navigation task aims to train an agent that can intelligently make a series of decisions to arrive at a pre-specified target location from any possible starting positions only based on egocentric views. However, most navigation methods currently struggle against several challenging problems, such as data efficiency, automatic obstacle avoidance, and generalization. Generalization problem means that agent does not have the ability to transfer navigation skills learned from previous experience to unseen targets and scenes. To address these issues, we incorporate two designs into classic DRL framework: attention on 3D knowledge graph (KG) and target skill extension (TSE) module. On the one hand, our proposed method combines visual features and 3D spatial representations to learn navigation policy. On the other hand, TSE module is used to generate sub-targets which allow agent to learn from failures. Specifically, our 3D spatial relationships are encoded through recently popular graph convolutional network (GCN). Considering the real world settings, our work also considers open action and adds actionable targets into conventional navigation situations. Those more difficult settings are applied to test whether DRL agent really understand its task, navigating environment, and can carry out reasoning. Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics, and improves generalization ability across targets and scenes.
ActioNet: An Interactive End-To-End Platform For Task-Based Data Collection And Augmentation In 3D Environment
The problem of task planning for artificial agents remains largely unsolved. While there has been increasing interest in data-driven approaches for the study of task planning for artificial agents, a significant remaining bottleneck is the dearth of large-scale comprehensive task-based datasets. In this paper, we present ActioNet, an interactive end-to-end platform for data collection and augmentation of task-based dataset in 3D environment. Using ActioNet, we collected a large-scale comprehensive task-based dataset, comprising over 3000 hierarchical task structures and videos. Using the hierarchical task structures, the videos are further augmented across 50 different scenes to give over 150,000 video. To our knowledge, ActioNet is the first interactive end-to-end platform for such task-based dataset generation and the accompanying dataset is the largest task-based dataset of such comprehensive nature. The ActioNet platform and dataset will be made available to facilitate research in hierarchical task planning.
DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following
Language-guided Embodied AI benchmarks requiring an agent to navigate an environment and manipulate objects typically allow one-way communication: the human user gives a natural language command to the agent, and the agent can only follow the command passively. We present DialFRED, a dialogue-enabled embodied instruction following benchmark based on the ALFRED benchmark. DialFRED allows an agent to actively ask questions to the human user; the additional information in the user's response is used by the agent to better complete its task. We release a human-annotated dataset with 53 K task-relevant questions and answers and an oracle to answer questions. To tackle DialFRED, we propose a questioner-performer framework wherein the questioner is pre-trained with the human-annotated data and fine-tuned with reinforcement learning. Experimental results show that asking the right questions leads to significantly improved task performance. We make DialFRED publicly available and encourage researchers to propose and evaluate their solutions to building dialog-enabled embodied agents:
A Modular Vision Language Navigation and Manipulation Framework for Long Horizon Compositional Tasks in Indoor Environment
Homagni Saha, Fateme Fotouhif, Qisai Liu, Soumik Sarkar Frontiers in robotics and AI 2021
In this paper we propose a new framework—MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from the traditional end-to-end techniques in this space and allows for a more tractable training process with separate vision and language data sets. Specifically, we propose a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following. We demonstrate a significant increase in success rates for long horizon, compositional tasks over recent works on the recently released benchmark data set -ALFRED.
Modular Framework for Visuomotor Language Grounding
Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research. However, data collection for these tasks is expensive and end-to-end approaches suffer from data inefficiency. We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently. Using a Language, Action, and Vision (LAV) framework removes the dependence of action and vision modules on instruction following datasets, making them more effi-cient to train. We also present a preliminary evaluation of LAV on the ALFRED task for visual and interactive instruction following.
Automata Guided Hierarchical Reinforcement Learning for Zero-shot Skill Composition
An obstacle that prevents the wide adoption of (deep) reinforcement learning (RL) in control systems is its need for a large number of interactions with the environment in order to master a skill. The learned skill usually generalizes poorly across domains and re-training is often necessary when presented with a new task. We present a framework that combines techniques in \textit{formal methods} with \textit{hierarchical reinforcement learning} (HRL). The set of techniques we provide allows for the convenient specification of tasks with logical expressions, learns hierarchical policies (meta-controller and low-level controllers) with well-defined intrinsic rewards using any RL methods and is able to construct new skills from existing ones without additional learning. We evaluate the proposed methods in a simple grid world simulation as well as simulation on a Baxter robot.
Communicative Learning with Natural Gestures for Embodied Navigation Agents with Human-in-the-Scene
Cheng-Ju Wu, Cheng-Ju Wu, Yixin Zhu, Jungseock Joo IEEE/RJS International Conference on Intelligent RObots and Systems 2021
Human-robot collaboration is an essential re-search topic in artificial intelligence (AI), enabling researchers to devise cognitive AI systems and affords an intuitive means for users to interact with the robot. Of note, communication plays a central role. To date, prior studies in embodied agent navigation have only demonstrated that human languages facilitate communication by instructions in natural languages. Nevertheless, a plethora of other forms of communication is left unexplored. In fact, human communication originated in gestures and oftentimes is delivered through multimodal cues, e.g., “go there” with a pointing gesture. To bridge the gap and fill in the missing dimension of communication in embodied agent navigation, we propose investigating the effects of using gestures as the communicative interface instead of verbal cues. Specifically, we develop a VR-based 3D simulation environment, named Gesture-based THOR (Ges-THOR), based on AI2-THOR platform. In this virtual environment, a human player is placed in the same virtual scene and shepherds the artificial agent using only gestures. The agent is tasked to solve the navigation problem guided by natural gestures with unknown semantics; we do not use any predefined gestures due to the diversity and versatile nature of human gestures. We argue that learning the semantics of natural gestures is mutually beneficial to learning the navigation task—learn to communicate and communicate to learn. In a series of experiments, we demonstrate that human gesture cues, even without predefined semantics, improve the object-goal navigation for an embodied agent, outperforming various state-of-the-art methods.
Contrasting Contrastive Self-Supervised Representation Learning Models
In the past few years, we have witnessed remarkable breakthroughs in self-supervised representation learning. Despite the success and adoption of representations learned through this paradigm, much is yet to be understood about how different training methods and datasets influence performance on downstream tasks. In this paper, we analyze contrastive approaches as one of the most successful and popular variants of self-supervised representation learning. We perform this analysis from the perspective of the training algorithms, pre-training datasets and end tasks. We examine over 700 training experiments including 30 encoders, 4 pre-training datasets and 20 diverse downstream tasks. Our experiments address various questions regarding the performance of self-supervised models compared to their supervised counterparts, current benchmarks used for evaluation, and the effect of the pre-training data on end task performance. We hope the insights and empirical evidence provided by this work will help future research in learning better visual representations.
Reasoning With Scene Graphs for Robot Planning Under Partial Observability
Robot planning in partially observable domains is difficult, because a robot needs to estimate the current state and plan actions at the same time. When the domain includes many objects, reasoning about the objects and their relationships makes robot planning even more difficult. In this letter, we develop an algorithm called scene analysis for robot planning (SARP) that enables robots to reason with visual contextual information toward achieving long-term goals under uncertainty. SARP constructs scene graphs, a factored representation of objects and their relations, using images captured from different positions, and reasons with them to enable context-aware robot planning under partial observability. Experiments have been conducted using multiple 3D environments in simulation, and a dataset collected by a real robot. In comparison to standard robot planning and scene analysis methods, in a target search domain, SARP improves both efficiency and accuracy in task completion.
Modularity Improves Out-of-Domain Instruction Following
We propose a modular architecture for following natural language instructions that describe sequences of diverse subgoals, such as navigating to landmarks or picking up objects. Standard, non-modular, architectures used in instruction following do not exploit subgoal compositionality and often struggle on out-of-distribution tasks and environments. In our approach, subgoal modules each carry out natural language instructions for a specific subgoal type. A sequence of modules to execute is chosen by learning to segment the instructions and predicting a subgoal type for each segment. When compared to standard sequence-to-sequence approaches on ALFRED, a challenging instruction following benchmark, we find that modularization improves generalization to environments unseen in training and to novel tasks.
Hierarchical and Partially Observable Goal-driven Policy Learning with Goals Relational Graph
Xin Ye, Yezhou Yang Computer Vision and Pattern Recognition 2021
We present a novel two-layer hierarchical reinforcement learning approach equipped with a Goals Relational Graph (GRG) for tackling the partially observable goal-driven task, such as goal-driven visual navigation. Our GRG captures the underlying relations of all goals in the goal space through a Dirichlet-categorical process that facilitates: 1) the high-level network raising a sub-goal towards achieving a designated final goal; 2) the low-level network towards an optimal policy; and 3) the overall system generalizing unseen environments and goals. We evaluate our approach with two settings of partially observable goal-driven tasks — a grid-world domain and a robotic object search task. Our experimental results show that our approach exhibits superior generalization performance on both unseen environments and new goals 1.
Indoor Navigation for Mobile Agents: A Multimodal Vision Fusion Model
Indoor navigation is a challenging task for mobile agents. The latest vision-based indoor navigation methods make remarkable progress in this field but do not fully leverage visual information for policy learning and struggle to perform well in unseen scenes. To address the existing limitations, we present a multimodal vision fusion model (MVFM). We implement a joint modality of different image recognition networks for navigation policy learning. The proposed model incorporates object detection for target searching, depth estimation for distance prediction, and semantic segmentation to depict the walkable region. In design, our model provides holistic vision knowledge for navigation. Evaluation on AI2-THOR indicates that MVFM improves on the results of a strong baseline model by 3.49% for Success weighted by Path Length (SPL) and 4% for success rate respectively. In comparison with other state-of-the-art systems, MVFM performs in the lead in terms of SPL and success rate. Extensive experiments show the effectiveness of the proposed model.
Optimistic Agent: Accurate Graph-Based Value Estimation for More Successful Visual Navigation
M. M. K. Moghaddam, Qi Wu, Ehsan Abbasnejad, J. Shi IEEE Workshop/Winter Conference on Applications of Computer Vision 2020
We humans can impeccably search for a target object, given its name only, even in an unseen environment. We argue that this ability is largely due to three main reasons: the incorporation of prior knowledge (or experience), the adaptation of it to the new environment using the observed visual cues and most importantly optimistically searching without giving up early. This is currently missing in the state-of-the-art visual navigation methods based on Reinforcement Learning (RL). In this paper, we propose to use externally learned prior knowledge of the relative object locations and integrate it into our model by constructing a neural graph. In order to efficiently incorporate the graph without increasing the state-space complexity, we propose Graph-based Value Estimation (GVE) module. GVE provides a more accurate baseline for estimating the Advantage function in actor-critic RL algorithm. This results in reduced value estimation error and, consequently, convergence to a more optimal policy. Through empirical studies, we show that our agent, dubbed as the optimistic agent, has a more realistic estimate of the state value during a navigation episode which leads to a higher success rate. Our extensive ablation studies show the efficacy of our simple method which achieves the state-of-the-art results measured by the conventional visual navigation metrics, e.g. Success Rate (SR) and Success weighted by Path Length (SPL), in AI2-THOR environment.
Towards Optimal Correlational Object Search
In realistic applications of object search, robots will need to locate target objects in complex environments while coping with unreliable sensors, especially for small or hard-to-detect objects. In such settings, correlational information can be valuable for planning efficiently. Previous approaches that consider correlational information typically resort to ad-hoc, greedy search strategies. We introduce the Correlational Object Search POMDP (COS-POMDP), which models correlations while preserving optimal solutions with a reduced state space. We propose a hierarchical planning algorithm to scale up COS-POMDPs for practical domains. Our evaluation, conducted with the AI2-THOR household simulator and the YOLOv5 object detector, shows that our method finds objects more successfully and efficiently compared to baselines, particularly for hard-to-detect objects such as srub brush and remote control.
Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning
Vision and voice are two vital keys for agents’ interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots’ environment understanding. We make use of single RGB images taken by a rst-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.
Semantic-Based Explainable AI: Leveraging Semantic Scene Graphs and Pairwise Ranking to Explain Robot Failures
When interacting in unstructured human environments, occasional robot failures are inevitable. When such failures occur, everyday people, rather than trained technicians, will be the first to respond. Existing natural language explanations hand-annotate contextual information from an environment to help everyday people understand robot failures. However, this methodology lacks generalizability and scalability. In our work, we introduce a more generalizable semantic explanation framework. Our framework autonomously captures the semantic information in a scene to produce semantically descriptive explanations for everyday users. To generate failure-focused explanations that are semantically grounded, we lever-ages both semantic scene graphs to extract spatial relations and object attributes from an environment, as well as pairwise ranking. Our results show that these semantically descriptive explanations significantly improve everyday users’ ability to both identify failures and provide assistance for recovery than the existing state-of-the-art context-based explanations.
Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
Huy Ha, Shuran Song ArXiv 2022
: We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs – a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary, materials/lighting, classes, and domains (i.e., real-world scans) from training on limited 3D synthetic data.
Pushing it out of the Way: Interactive Visual Navigation
We have observed significant progress in visual navigation for embodied agents. A common assumption in studying visual navigation is that the environments are static; this is a limiting assumption. Intelligent navigation may involve interacting with the environment beyond just moving forward/backward and turning left/right. Sometimes, the best way to navigate is to push something out of the way. In this paper, we study the problem of interactive navigation where agents learn to change the environment to navigate more efficiently to their goals. To this end, we introduce the Neural Interaction Engine (NIE) to explicitly predict the change in the environment caused by the agent’s actions. By modeling the changes while planning, we find that agents exhibit significant improvements in their navigational capabilities. More specifically, we consider two downstream tasks in the physics-enabled, visually rich, AI2-THOR environment: (1) reaching a target while the path to the target is blocked (2) moving an object to a target location by pushing it. For both tasks, agents equipped with an NIE significantly outperform agents without the understanding of the effect of the actions indicating the benefits of our approach. The code and dataset are available at
Towards Embodied Scene Description
Sinan Tan, Huaping Liu, Di Guo, Fuchun Sun Robotics: Science and Systems 2020
Embodiment is an important characteristic for all intelligent agents (creatures and robots), while existing scene description tasks mainly focus on analyzing images passively and the semantic understanding of the scenario is separated from the interaction between the agent and the environment. In this work, we propose the Embodied Scene Description, which exploits the embodiment ability of the agent to find an optimal viewpoint in its environment for scene description tasks. A learning framework with the paradigms of imitation learning and reinforcement learning is established to teach the intelligent agent to generate corresponding sensorimotor activities. The proposed framework is tested on both the AI2-THOR dataset and a real world robotic platform demonstrating the effectiveness and extendability of the developed method.
Hierarchical Control of Situated Agents through Natural Language
When humans perform a particular task, they do so hierarchically: splitting higher-level tasks into smaller sub-tasks. However, most works on natural language (NL) command of situated agents have treated the procedures to be executed as flat sequences of simple actions, or any hierarchies of procedures have been shallow at best. In this paper, we propose a formalism of procedures as programs, a method for representing hierarchical procedural knowledge for agent command and control aimed at enabling easy application to various scenarios. We further propose a modeling paradigm of hierarchical modular networks, which consist of a planner and reactors that convert NL intents to predictions of executable programs and probe the environment for information necessary to complete the program execution. We instantiate this framework on the IQA and ALFRED datasets for NL instruction following. Our model outperforms reactive baselines by a large margin on both datasets. We also demonstrate that our framework is more data-efficient, and that it allows for fast iterative development.
TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors
. We introduce TIDEE, an embodied agent that tidies up a disordered scene based on learned commonsense object placement and room arrangement priors. TIDEE explores a home environment, detects objects that are out of their natural place, infers plausible object contexts for them, localizes such contexts in the current scene, and repositions the objects. Commonsense priors are encoded in three modules: i) visuo-semantic detectors that detect out-of-place objects, ii) an associative neural graph memory of objects and spatial relations that proposes plausible semantic receptacles and surfaces for object repositions, and iii) a visual search network that guides the agent’s exploration for efficiently localizing the receptacle-of-interest in the current scene to reposition the object. We test TIDEE on tidying up disorganized scenes in the AI2-THOR simulation environment. TIDEE carries out the task directly from pixel and raw depth input without ever having observed the same room beforehand, relying only on priors learned from a separate set of training houses. Human evaluations on the resulting room reorga-nizations show TIDEE outperforms ablative versions of the model that do not use one or more of the commonsense priors. On a related room rearrangement benchmark that allows the agent to view the goal state prior to rearrangement, a simplified version of our model significantly outperforms a top-performing method by a large margin. Code and data are available at the project website: . (i) : A model common receptacle in the training set for the out-of-place object category. (ii) WithoutMemex : A model that uses the scene but not the Memex for graph (iii) 3DSmntMap2Place : A model proposes repositioning within the current scene by category label of out-of-place map farthest point set object placement The by confidence value and visited sequentially until any receptacle is found within the local of the proposed location. (iv) RandomReceptacle : A model that selects the target receptacle the first receptacle detected by a random exploration agent. (v) AI2-THORPlacement : The location of the OOP object in the original (tidy) AITHOR scene. The default object positions usually follow commonsense priors of scene arrangements. (vi) MessyPlacement : The location of the OOP object in the messy scene.
A Deep Reinforcement Learning Based Mapless Navigation Algorithm Using Continuous Actions
In this paper, we propose a maples navigation for robot with deep reinforcement learning and continuous actions, in order to investigate the effect of continuous actions for robots mapless navigation. Assuming that the positions of robots can be easily obtained by indoor localization system, the robot agent are trained in a simulation environment to learn mapless navigation policy by taking only obstacle distances and relative positions to the target. Considering the state of robot motion in real world, properly limited range of continuous actions are given to the agent to choose. So the agent output steering angle and moving distance in the range of limitation. We valid that continuous actions allow the agent to have richer explorations, flexible movements and thus higher possibility to reach the navigation target, by experiments of comparing with traditional discrete actions.
LEBP - Language Expectation & Binding Policy: A Two-Stream Framework for Embodied Vision-and-Language Interaction Task Learning Agents
People always desire an embodied agent that can perform a task by understanding language instruction. Moreover, they also want to monitor and expect agents to understand commands the way they expected. But, how to build such an embodied agent is still unclear. Recently, people can explore this problem with the Vision-and-Language Interaction benchmark ALFRED, which requires an agent to perform complicated daily household tasks following natural language instructions in unseen scenes. In this paper, we propose LEBP – Language Expectation with Binding Policy Module to tackle the ALFRED. The LEBP contains a two-stream process: 1) it first conducts a language expectation module to generate an expectation describing how to perform tasks by understanding the language instruction. The expectation consists of a sequence of sub-steps for the task (e.g., Pick an apple). The expectation allows people to access and check the understanding results of instructions before the agent takes actual actions, in case the task might go wrong. 2) Then, it uses the binding policy module to bind sub-steps in expectation to actual actions to specific scenarios. Actual actions include navigation and object manipulation. Experimental results suggest our approach achieves comparable performance to currently published SOTA methods and can avoid large decay from seen scenarios to unseen scenarios.
ForeSI: Success-Aware Visual Navigation Agent
M. M. K. Moghaddam, Ehsan Abbasnejad, Qi Wu, Qinfeng Shi, A. V. Hengel IEEE Workshop/Winter Conference on Applications of Computer Vision 2022
In this work, we present a method to improve the efficiency and robustness of the previous model-free Reinforcement Learning (RL) algorithms for the task of object-goal visual navigation. Despite achieving state-of-the-art results, one of the major drawbacks of those approaches is the lack of a forward model that informs the agent about the potential consequences of its actions, i.e., being model-free. In this work, we augment the model-free RL with such a forward model that can predict a representation of a future state, from the beginning of a navigation episode, if the episode were to be successful. Furthermore, in order for efficient training, we develop an algorithm to integrate a replay buffer into the model-free RL that alternates between training the policy and the forward model. We call our agent ForeSI; ForeSI is trained to imagine a future latent state that leads to success. By explicitly imagining such a state, during the navigation, our agent is able to take better actions leading to two main advantages: first, in the absence of an object detector, ForeSI presents a more robust policy, i.e., it leads to about 5% absolute improvement on the Success Rate (SR); second, when combined with an off-the-shelf object detector to help better distinguish the target object, our method leads to about 3% absolute improvement on the SR and about 2% absolute improvement on Success weighted by inverse Path Length (SPL), i.e., presents higher efficiency.
LUMINOUS: Indoor Scene Generation for Embodied AI Challenges
Learning-based methods for training embodied agents typically require a large number of high-quality scenes that contain realistic layouts and support meaningful interactions. However, current simulators for Embodied AI (EAI) challenges only provide simulated indoor scenes with a limited number of layouts. This paper presents LUMINOUS, the first research framework that employs stateof-the-art indoor scene synthesis algorithms to generate large-scale simulated scenes for Embodied AI challenges. Further, we automatically and quantitatively evaluate the quality of generated indoor scenes via their ability to support complex household tasks. LUMINOUS incorporates a novel scene generation algorithm (Constrained Stochastic Scene Generation (CSSG)), which achieves competitive performance with human-designed scenes. Within LUMINOUS, the EAI task executor, task instruction generation module, and video rendering toolkit can collectively generate a massive multimodal dataset of new scenes for the training and evaluation of Embodied AI agents. Extensive experimental results demonstrate the effectiveness of the data generated by LUMINOUS, enabling the comprehensive assessment of embodied agents on generalization and robustness. The full codebase and documentation of LUMINOUS is available at: https: //
VUSFA: Variational Universal Successor Features Approximator to Improve Transfer DRL for Target Driven Visual Navigation
In this paper, we show how novel transfer reinforcement learning techniques can be applied to the complex task of target driven navigation using the photorealistic AI2-THOR simulator. Specifically, we build on the concept of Universal Successor Features with an A3C agent. We introduce the novel architectural contribution of a Successor Feature Dependant Policy (SFDP) and adopt the concept of Variational Information Bottlenecks to achieve state of the art performance. VUSFA, our final architecture, is a straightforward approach that can be implemented using our open source repository. Our approach is generalizable, showed greater stability in training, and outperformed recent approaches in terms of transfer learning ability.
Multi-Agent Embodied Visual Semantic Navigation With Scene Prior Knowledge
In visual semantic navigation, the robot navigates to a target object with egocentric visual observations and the class label of the target is given. It is a meaningful task inspiring a surge of relevant research. However, most of the existing models are only effective for single-agent navigation, and a single agent has low efficiency and poor fault tolerance when conducting more complicated tasks. Multi-agent collaboration can improve the efficiency and has strong application potentials. In this letter, we propose the multi-agent visual semantic navigation, in which multiple agents collaborate with others to find multiple target objects. It is a challenging task that requires agents to learn reasonable collaboration strategies to perform efficient exploration under the restrictions of communication bandwidth. We develop a hierarchical decision framework based on semantic mapping, scene prior knowledge, and communication mechanism to solve this task. The experimental results in unseen scenes with both seen objects and unseen objects illustrate the higher accuracy and efficiency of the proposed model compared with the single-agent model.
Few-shot Subgoal Planning with Language Models
Lajanugen Logeswaran, Yao Fu, Honglak Lee, Honglak Lee North American Chapter of the Association for Computational Linguistics 2022
Pre-trained language models have shown successful progress in many text understanding benchmarks. This work explores the capability of these models to predict actionable plans in real-world environments. Given a text instruction, we show that language priors encoded in pre-trained models allow us to infer fine-grained subgoal sequences. In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences from few training sequences without any fine-tuning. We further propose a simple strategy to re-rank language model predictions based on interaction and feedback from the environment. Combined with pre-trained navigation and visual reasoning components, our approach demonstrates competitive performance on subgoal prediction and task completion in the ALFRED benchmark compared to prior methods that assume more subgoal supervision.
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments
Difei Gao, Ruiping Wang, Ziyi Bai, Xilin Chen IEEE International Conference on Computer Vision 2021
Visual understanding goes well beyond the study of images or videos on the web. To achieve complex tasks in volatile situations, the human can deeply understand the environment, quickly perceive events happening around, and continuously track objects’ state changes, which are still challenging for current AI systems. To equip AI system with the ability to understand dynamic ENVironments, we build a video Question Answering dataset named Env-QA. Env-QA contains 23K egocentric videos, where each video is composed of a series of events about exploring and interacting in the environment. It also provides 85K questions to evaluate the ability of understanding the composition, layout, and state changes of the environment presented by the events in videos. Moreover, we propose a video QA model, Temporal Segmentation and Event Attention network (TSEA), which introduces event-level video representation and corresponding attention mechanisms to better extract environment information and answer questions. Comprehensive experiments demonstrate the effectiveness of our framework and show the formidable challenges of Env-QA in terms of long-term state tracking, multi-event temporal reasoning and event counting, etc.
Shaping embodied agent behavior with activity-context priors from egocentric video
Complex physical tasks entail a sequence of object interactions, each with its own preconditions—which can be difficult for robotic agents to learn efficiently solely through their own experience. We introduce an approach to discover activitycontext priors from in-the-wild egocentric video captured with human worn cameras. For a given object, an activity-context prior represents the set of other compatible objects that are required for activities to succeed (e.g., a knife and cutting board brought together with a tomato are conducive to cutting). We encode our video-based prior as an auxiliary reward function that encourages an agent to bring compatible objects together before attempting an interaction. In this way, our model translates everyday human experience into embodied agent skills. We demonstrate our idea using egocentric EPIC-Kitchens video of people performing unscripted kitchen activities to benefit virtual household robotic agents performing various complex tasks in AI2-iTHOR, significantly accelerating agent learning. Project page:
Counterfactual Depth from a Single RGB Image
We describe a method that predicts, from a single RGB image, a depth map that describes the scene when a masked object is removed - we call this "counterfactual depth" that models hidden scene geometry together with the observations. Our method works for the same reason that scene completion works: the spatial structure of objects is simple. But we offer a much higher resolution representation of space than current scene completion methods, as we operate at pixel-level precision and do not rely on a voxel representation. Furthermore, we do not require RGBD inputs. Our method uses a standard encoder-decoder architecture, and with a decoder modified to accept an object mask. We describe a small evaluation dataset that we have collected, which allows inference about what factors affect reconstruction most strongly. Using this dataset, we show that our depth predictions for masked objects are better than other baselines.
Object-oriented Targets for Visual Navigation using Rich Semantic Representations
When searching for an object humans navigate through a scene using semantic information and spatial relationships. We look for an object using our knowledge of its attributes and relationships with other objects to infer the probable location. In this paper, we propose to tackle the visual navigation problem using rich semantic representations of the observed scene and object-oriented targets to train an agent. We show that both allows the agent to generalize to new targets and unseen scene in a short amount of training time.
Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in First-person Simulated 3D Environments
Learning how to execute complex tasks involving multiple objects in a 3D world is challenging when there is no ground-truth information about the objects or any demonstration to learn from. When an agent only receives a signal from task-completion, this makes it challenging to learn the object-representations which support learning the correct object-interactions needed to complete the task. In this work, we formulate learning an attentive object dynamics model as a classification problem, using random object-images to define incorrect labels for our object-dynamics model. We show empirically that this enables object-representation learning that captures an object's category (is it a toaster?), its properties (is it on?), and object-relations (is something inside of it?). With this, our core learner (a relational RL agent) receives the dense training signal it needs to rapidly learn object-interaction tasks. We demonstrate results in the 3D AI2-THOR simulated kitchen environment with a range of challenging food preparation tasks. We compare our method's performance to several related approaches and against the performance of an oracle: an agent that is supplied with ground-truth information about objects in the scene. We find that our agent achieves performance closest to the oracle in terms of both learning speed and maximum success rate.
What do navigation agents learn about their environment?
Kshitij Dwivedi, Gemma Roig, Aniruddha Kembhavi, Roozbeh Mottaghi Computer Vision and Pattern Recognition 2022
Today's state of the art visual navigation agents typically consist of large deep learning models trained end to end. Such models offer little to no interpretability about the learned skills or the actions of the agent taken in response to its environment. While past works have explored interpreting deep learning models, little attention has been devoted to interpreting embodied AI systems, which often involve reasoning about the structure of the environment, target characteristics and the outcome of one's actions. In this paper, we introduce the Interpretability System for Embodied agEnts (iSEE) for Point Goal and Object Goal navigation agents. We use iSEE to probe the dynamic representations produced by these agents for the presence of information about the agent as well as the environment. We demonstrate interesting insights about navigation agents using iSEE, including the ability to encode reachable locations (to avoid obstacles), visibility of the target, progress from the initial spawn location as well as the dramatic effect on the behaviors of agents when we mask out critical individual neurons.
Vision Memory for Target Object Navigation Using Deep Reinforcement Learning: An Empirical Study
Recently, a number of methods have been conducted by combining deep neural network and reinforcement learning to solve problems. Neural networks have strong ability to deal with high dimensional data and a good mean to learn features while reinforcement learning allows a system to learn by experience optimal action control and adjust its behavior to new environments. It thus may be used in mobile robot navigation tasks with perceptron as deep neural network and learning to control by reinforcement learning. We first investigate some issues in agents learning to navigate in indoor environments such as how robot memorize vision information and how it discovers the environments. We show that agents can rely on some position with vision information in navigation task such as reduce training time and locating object efficiently. Some approaches based on deep reinforcement learning are step by step discussed and proposed to deal with each problem in target object navigation task. The key ideas include adding checking point as vision memory and using auxiliary learning task to enhance agents to discover the environment.
A Simulator for Human-Robot Interaction in Virtual Reality
We present a suite of tools to model a robot, its sensors, and the surrounding environment in VR, with the goal of collecting training data for real-world robots. The virtual robot observes a rigged avatar created in our photogrammetry facility and embodying a VR user. We are particularly interested in verbal human/robot interactions, which can be combined with the robot’s sensor data for grounded language learning. Because virtual scenes, tasks, and robots are easily reconfigured compared to their physical analogs, our approach proves extremely versatile in preparing a wide range of robot scenarios for an array of use cases.
Target Driven Visual Navigation with Hybrid Asynchronous Universal Successor Representations
Being able to navigate to a target with minimal supervision and prior knowledge is critical to creating human-like assistive agents. Prior work on map-based and map-less approaches have limited generalizability. In this paper, we present a novel approach, Hybrid Asynchronous Universal Successor Representations (HAUSR), which overcomes the problem of generalizability to new goals by adapting recent work on Universal Successor Representations with Asynchronous Actor-Critic Agents. We show that the agent was able to successfully reach novel goals and we were able to quickly fine-tune the network for adapting to new scenes. This opens up novel application scenarios where intelligent agents could learn from and adapt to a wide range of environments with minimal human input.
IFR-Explore: Learning Inter-object Functional Relationships in 3D Indoor Scenes
single-object or agent-object functionality to study a new kind of visual relationship that is also important to perceive and model – inter-object functional relationships ( e.g. , a switch on the wall turns on or off the light, a remote control operates the TV). Humans often spend little or no effort to infer these relationships, even when entering a new room, by using our strong prior knowledge ( e.g. , we know that buttons control electrical devices) or using only a few exploratory interactions in cases of uncertainty ( e.g. , multiple switches and lights in the same room). In this paper, we take the first step in building AI system learning inter-object functional relationships in 3D indoor environments with key technical contributions of modeling prior knowledge by training over large-scale scenes and designing interactive policies for effectively exploring the training scenes and quickly adapting to novel test scenes. We create a new benchmark based on the AI2-THOR and PartNet datasets and perform extensive experiments that prove the effectiveness of our proposed method. Results show that our model successfully learns priors and fast-interactive-adaptation strategies for exploring inter-object functional relationships in complex 3D scenes. Several ablation studies further the usefulness of each proposed
Interactron: Embodied Adaptive Object Detection
Klemen Kotar, Roozbeh Mottaghi Computer Vision and Pattern Recognition 2022
Over the years various methods have been proposed for the problem of object detection. Recently, we have wit-nessed great strides in this domain owing to the emergence of powerful deep neural networks. However, there are typically two main assumptions common among these approaches. First, the model is trained on a fixed training set and is evaluated on a pre-recorded test set. Second, the model is kept frozen after the training phase, so no further updates are performed after the training is finished. These two assumptions limit the applicability of these methods to real-world settings. In this paper, we propose Interactron, a method for adaptive object detection in an interactive setting, where the goal is to perform object detection in images observed by an embodied agent navigating in different environments. Our idea is to continue training during inference and adapt the model at test time without any explicit supervision via interacting with the environment. Our adaptive object detection model provides a 11.8 point improvement in AP (and 19.1 points in AP50AP_{50}) over DETR [5]. a recent, high-performance object detector. Moreover, we show that our object detection model adapts to environments with completely different appearance characteristics, and its performance is on par with a model trained with full supervision for those environments. The code is available at:
Object Memory Transformer for Object Goal Navigation
This paper presents a reinforcement learning method for object goal navigation (ObjNav) where an agent navigates in 3D indoor environments to reach a target object based on long-term observations of objects and scenes. To this end, we propose Object Memory Transformer (OMT) that consists of two key ideas: 1) Object-Scene Memory (OSM) that enables to store long-term scenes and object semantics, and 2) Transformer that attends to salient objects in the sequence of previously observed scenes and objects stored in OSM. This mechanism allows the agent to efficiently navigate in the indoor environment without prior knowledge about the environments, such as topological maps or 3D meshes. To the best of our knowledge, this is the first work that uses a long-term memory of object semantics in a goal-oriented navigation task. Experimental results conducted on the AI2-THOR dataset show that OMT outperforms previous approaches in navigating in unknown environments. In particular, we show that utilizing the long-term object semantics information improves the efficiency of navigation.
Memory-Based Parameterized Skills Learning for Mapless Visual Navigation
The recently-proposed reinforcement learning for mapless visual navigation can generate an optimal policy for searching different targets. However, most state-of-the-art deep reinforcement learning (DRL) models depend on hard rewards to learn the optimal policy, which can lead to the lack of previous diverse experiences. Moreover, these pre-trained DRL models cannot generalize well to un-trained tasks. To overcome these problems above, in this paper, we propose a Memory-based Parameterized Skills Learning (MPSL) model for mapless visual navigation. The parameterized skills in our MPSL are learned to predict critic parameters for un-trained tasks in actor-critic reinforcement learning, which can be achieved by transferring memory sequence knowledge from long short term memory network. In order to generalize into un-trained tasks, MPSL aims to capture more discriminative features by using a scene-specific layer. Finally, experiment results on an indoor photographic simulation framework AI2-THOR demonstrate the effectiveness of our proposed MPSL model, and the generalization ability to un-trained tasks.
Multi goals and multi scenes visual mapless navigation in indoor using meta-learning and scene priors
Fei Li, Chi Guo, Bin Luo, Huyin Zhang Neurocomputing 2021
Abstract The goal of visual mapless navigation is to navigate from a random starting point in a scene to a specified target in an unknown environment. A fundamental challenge in visual mapless navigation is generalizing to a novel environment, where the layout of the scenes and appearance of targets are unfamiliar. Furthermore, traditional navigation models are frozen during inference resulting in poor adaptability. To address these issues, we propose a multi goals and multi scenes visual mapless navigation model, which integrate meta learning with spatial relationships between different object categories. In this way, our method not only improves the generalization on multi goals in multi scenes but also encourages effective navigation. Experimental results on AI2-THOR dataset show that our approach significantly outperforms the state-of-the-art model SAVN by > 27.05 % for the average success rate and by > 31.7 % for the average SPL. Our source code and data of this paper are available at:
Knowledge-based Embodied Question Answering
In this paper, we propose a novel Knowledge-based Embodied Question Answering (K-EQA) task, in which the agent intelligently explores the environment to answer various questions with the knowledge. Different from explicitly specifying the target object in the question as existing EQA work, the agent can resort to external knowledge to understand more complicated question such as “Please tell me what are objects used to cut food in the room?”, in which the agent must know the knowledge such as “knife is used
Deep Reinforcement Learning for Visual Semantic Navigation with Memory
Navigation is an important activity to be performed by mobile robots with high complexity in the context of indoor environments. Approaches as Deep Reinforcement Learning has been adopted for this purpose, from the premise of learning through experiences and taking advantage of Deep Neural Networks as Convolutional Networks, Graph Neural Networks, and Recurrent Networks. Based on the use of vision and semantic context applied in this work, the effects of adding Recurrent Networks on a learning-based navigation model are investigated, making possible the learning of better policies with the use of memory from past experiences. Results obtained show that the proposed approach gets better values in terms of qualitative as quantitative measures when compared to models without memory.
A Neural-Symbolic Approach for Object Navigation
Xiaotian Liu CVPR-W 2021
Object navigation refers to the task of discovering and locating objects in an unknown environment. End-to-end deep learning methods struggle at this task due to sparse rewards. In this work, we propose a simple neural-symbolic approach for object navigation in the AI2-THOR environment. Our method takes raw RGB images as input and uses a spatial memory graph as memory to store object and location information. The architecture consists of both a convolutional neural network for object detection and a spatial graph to represent the environment. By having a discrete graph representation of the environment, the agent can directly use search or planning algorithms as high-level reasoning engines. Model performance is evaluated on both task completion rate and steps required to reach target objects. Empirical results demonstrate that our approach can achieve performance close to the optimal. Our work builds a foundation for a neural-symbolic approach that can reason via unstructured visual cues.
Image Visual Sensor Used in Health-Care Navigation in Indoor Scenes Using Deep Reinforcement Learning (DRL) and Control Sensor Robot for Patients Data Health Information
Walead Kaled Seaman, S. Yavuz J. Medical Imaging Health Informatics 2021
Compared with traditional motion planners and deep reinforcement learning DRL has been applied more and more widely to achieving sequential behaviors control of movement robots in internal environment. There are two addressed issues of deep learning. The inability to generalize to achieve set of goals. The data inefficiency, that is, the model requires, many trial and error loops (often costly). Applied can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging. In this paper, we address these two issues and apply the proposed model to visual navigation in conformity with generalizing in conformity with obtaining new goals (target-driven). To tackle the first issue, we advise an actor-critic mannequin whose coverage is a feature of the intention as much properly namely the present day state, which approves higher generalization. To tackle the second issue, we advocate the 3D scenes in environment indoor simulation is AI2-THOR framework, who provides a surrounding including tremendous with high-quality 3D scenes and a physics engine. Our framework allows agents according to receive actions and have interaction with objects. Hence, we are able to accumulate an enormous number of training samples successfully with sequential decision making based totally on the RL framework. Particularly, Healthcare and medicine stand to benefit immensely from deep learning because of the sheer volume of data being generated we used the behavioral cloning approach, who enables the active agent to storeroom an expert (or mentor) policy except for the utilization of reward function stability or generalizes across targets.
SeanNet: Semantic Understanding Network for Localization Under Object Dynamics
We aim for domestic robots to operate indoor for long-term service. Under the object-level scene dynamics induced by human daily activities, a robot needs to robustly localize itself in the environment subject to scene uncertainties. Previous works have addressed visual-based localization in static environments, yet the object-level scene dynamics challenge existing methods on long-term deployment of the robot. This paper proposes SEmantic understANding Network (SeanNet) that enables robots to measure the similarity between two scenes on both visual and semantic aspects. We further develop a similarity-based localization method based on SeanNet for monitoring the progress of visual navigation tasks. In our experiments, we benchmarked SeanNet against baselines methods on scene similarity measures, as well as visual navigation performance once integrated with a visual navigator. We demonstrate that SeanNet outperforms all baseline methods, by robustly localizing the robot under object dynamics, thus reliably informing visual navigation about the task status.
Retrospectives on the Embodied AI Workshop
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
Learning for Visual Navigation by Imagining the Success
Visual navigation is often cast as a reinforcement learning (RL) problem. Current methods typically result in a suboptimal policy that learns general obstacle avoidance and search behaviours. For example, in the target-object navigation setting, the policies learnt by traditional methods often fail to complete the task, even when the target is clearly within reach from a human perspective. In order to address this issue we propose to learn to imagine a latent representation of the successful (sub-)goal state. To do so, we have developed a module which we call Foresight Imagination (ForeSIT). ForeSIT is trained to imagine the recurrent latent representation of a future state that leads to success, e.g. either a sub-goal state that is important to reach before the target, or the goal state itself. By conditioning the policy on the generated imagination during training, our agent learns how to use this imagination to achieve its goal robustly. Our agent is able to imagine what the (sub-)goal state may look like (in the latent space) and can learn to navigate towards that state. We develop an efficient learning algorithm to train ForeSIT in an on-policy manner and integrate it into our RL objective. The integration is not trivial due to the constantly evolving state representation shared between both the imagination and the policy. We, empirically, observe that our method outperforms the stateof-the-art methods by a large margin in the commonly accepted benchmark AI2-THOR environment. Our method can be readily integrated or added to other model-free RL navigation frameworks.
Unbiased Directed Object Attention Graph for Object Navigation
Object navigation tasks require agents to locate specific objects in unknown environments based on visual information. Previously, graph convolutions were used to implicitly explore the relationships between objects. However, due to differences in visibility among objects, it is easy to generate biases in object attention. Thus, in this paper, we propose a directed object attention (DOA) graph to guide the agent in explicitly learning the attention relationships between objects, thereby reducing the object attention bias. In particular, we use the DOA graph to perform unbiased adaptive object attention (UAOA) on the object features and unbiased adaptive image attention (UAIA) on the raw images, respectively. To distinguish features in different branches, a concise adaptive branch energy distribution (ABED) method is proposed. We assess our methods on the AI2-THOR dataset. Compared with the state-of-the-art (SOTA) method, our method reports 7.4%, 8.1% and 17.6% increase in success rate (SR), success weighted by path length (SPL) and success weighted by action efficiency (SAE), respectively.
Utilising Prior Knowledge for Visual Navigation: Distil and Adapt
We, as humans, can impeccably navigate to localise a target object, even in an unseen environment. We argue that this impressive ability is largely due to incorporation of \emph{prior knowledge} (or experience) and \emph{visual cues}--that current visual navigation approaches lack. In this paper, we propose to use externally learned prior knowledge of object relations, which is integrated to our model via constructing a neural graph. To combine appropriate assessment of the states and the prior (knowledge), we propose to decompose the value function in the actor-critic reinforcement learning algorithm and incorporate the prior in the critic in a novel way that reduces the model complexity and improves model generalisation. Our approach outperforms the current state-of-the-art in AI2-THOR visual navigation dataset.
ASC me to Do Anything: Multi-task Training for Embodied AI
Embodied AI has seen steady progress across a diverse set of independent tasks. While these varied tasks have different end goals, the basic skills required to complete them successfully overlap significantly. In this paper, our goal is to leverage these shared skills to learn to perform multiple tasks jointly. We propose Atomic Skill Completion (ASC), an approach for multi-task training for Embodied AI, where a set of atomic skills shared across multiple tasks are composed together to perform the tasks. The key to the success of this approach is a pre-training scheme that decouples learning of the skills from the high-level tasks making joint training effective. We use ASC to train agents within the AI2-THOR environment to perform four interactive tasks jointly, and find it to be remarkably effective. In a multi-task setting, ASC improves success rates by a factor of 2x on Seen scenes and 4x on Unseen scenes compared to no pre-training. Importantly, ASC enables us to train a multi-task agent that has a 52% higher Success Rate than training 4 independent single task agents. Finally, our hierarchical agents are more interpretable than traditional black box architectures.
JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents
Building a conversational embodied agent to execute reallife tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task complexity, and are often hard to explain. To benefit from both worlds, we propose JARVIS, a neuro-symbolic commonsense reasoning framework for modular, generalizable, and interpretable conversational embodied agents. First, it acquires symbolic representations by prompting large language models (LLMs) for language understanding and sub-goal planning, and by constructing semantic maps from visual observations. Then the symbolic module reasons for sub-goal planning and action generation based on taskand action-level common sense. Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialogbased embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1% to 15.8%). Moreover, we systematically analyze the essential factors that affect the task performance and also demonstrate the superiority of our method in few-shot settings. Our JARVIS model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.
Agent-Centric Relation Graph for Object Visual Navigation
Object visual navigation aims to steer an agent towards a target object based on visual observations of the agent. It is highly desirable to reasonably perceive the environment and accurately control the agent. In the navigation task, we introduce an Agent-Centric Relation Graph (ACRG) for learning the visual representation based on the relationships in the environment. ACRG is a highly effective and reasonable structure that consists of two relationships, i.e., the relationship among objects and the relationship between the agent and the target. On the one hand, we design the Object Horizontal Relationship Graph (OHRG) that stores the relative horizontal location among objects. Note that the vertical relationship is not involved in OHRG, and we argue that OHRG is suitable for the control strategy. On the other hand, we propose the Agent-Target Depth Relationship Graph (ATDRG) that enables the agent to perceive the distance to the target. To achieve ATDRG, we utilize image depth to represent the distance. Given the above relationships, the agent can perceive the environment and output navigation actions. Given the visual representations constructed by ACRG and position-encoded global features, the agent can capture the target position to perform navigation actions. Experimental results in the artificial environment AI2-THOR demonstrate that ACRG significantly outperforms other state-of-the-art methods in unseen testing environments.
Occupancy Map Prediction for Improved Indoor Robot Navigation
In the typical path planning pipeline for a ground robot, we build a map (e.g., an occupancy grid) of the environment as the robot moves around. While navigating indoors, a ground robot’s knowledge about the environment may be limited by the occlusions in its surroundings. Therefore, the map will have many as-yet-unknown regions that may need to be avoided by a conservative planner. Instead, if a robot is able to correctly infer what its surroundings and occluded regions look like, the navigation can be further optimized. In this work, we propose an approach using pix2pix and UNet to infer the occupancy grid in unseen areas near the robot as an image-to-image translation task. Our approach simplifies the task of occupancy map prediction for the deep learning network and reduces the amount of data required compared to similar existing methods. We show that the predicted map improves the navigation time in simulations over the existing approaches.
Robot in a China Shop: Using Reinforcement Learning for Location-Specific Navigation Behaviour
Robots need to be able to work in multiple different environments. Even when performing similar tasks, different behaviour should be deployed to best fit the current environment. In this paper, We propose a new approach to navigation, where it is treated as a multi-task learning problem. This enables the robot to learn to behave differently in visual navigation tasks for different environments while also learning shared expertise across environments. We evaluated our approach in both simulated environments as well as real-world data. Our method allows our system to converge with a 26% reduction in training time, while also increasing accuracy.
Dynamic Value Estimation for Single-Task Multi-Scene Reinforcement Learning
Training deep reinforcement learning agents on environments with multiple levels / scenes / conditions from the same task, has become essential for many applications aiming to achieve generalization and domain transfer from simulation to the real world. While such a strategy is helpful with generalization, the use of multiple scenes significantly increases the variance of samples collected for policy gradient computations. Current methods continue to view this collection of scenes as a single Markov Decision Process (MDP) with a common value function; however, we argue that it is better to treat the collection as a single environment with multiple underlying MDPs. To this end, we propose a dynamic value estimation (DVE) technique for these multiple-MDP environments, motivated by the clustering effect observed in the value function distribution across different scenes. The resulting agent is able to learn a more accurate and scene-specific value function estimate (and hence the advantage function), leading to a lower sample variance. Our proposed approach is simple to accommodate with several existing implementations (like PPO, A3C) and results in consistent improvements for a range of ProcGen environments and the AI2-THOR framework based visual navigation task.
Messing Up 3D Virtual Environments: Transferable Adversarial 3D Objects
In the last few years, the scientific community showed a remarkable and increasing interest towards 3D Virtual Environments, training and testing Machine Learning-based models in realistic virtual worlds. On one hand, these environments could also become a mean to study the weaknesses of Machine Learning algorithms, or to simulate training settings that allow Machine Learning models to gain robustness to 3D adversarial attacks. On the other hand, their growing popularity might also attract those that aim at creating adversarial conditions to invalidate the benchmarking process, especially in the case of public environments that allow the contribution from a large community of people. Most of the existing Adversarial Machine Learning approaches are focused on static images, and little work has been done in studying how to deal with 3D environments and how a 3D object should be altered to fool a classifier that observes it. In this paper, we study how to craft adversarial 3D objects by altering their textures, using a tool chain composed of easily accessible elements. We show that it is possible, and indeed simple, to create adversarial objects using off-the-shelf limited surrogate renderers that can compute gradients with respect to the parameters of the rendering process, and, to a certain extent, to transfer the attacks to more advanced 3D engines. We propose a saliency-based attack that intersects the two classes of renderers in order to focus the alteration to those texture elements that are estimated to be effective in the target engine, evaluating its impact in popular neural classifiers.
Object Manipulation via Visual Target Localization
Kiana Ehsani, Ali Farhadi, Aniruddha Kembhavi, Roozbeh Mottaghi European Conference on Computer Vision 2022
. Object manipulation is a critical skill required for Embodied AI agents interacting with the world around them. Training agents to manipulate objects, poses many challenges. These include occlusion of the target object by the agent’s arm, noisy object detection and localization, and the target frequently going out of view as the agent moves around in the scene. We propose Manipulation via Visual Object Location Estimation (m-VOLE), an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible, thus robustly aiding the task of manipulating these objects throughout the episode. Our evaluations show a massive 3 x improvement in success rate over a model that has access to the same sensory suite but is trained without the object location estimator, and our analysis shows that our agent is robust to noise in depth perception and agent localization. Importantly, our proposed approach relaxes several assumptions about idealized localization and perception that are commonly employed by recent works in navigation and manipulation – an important step towards training agents for object manipulation in the real world. Our code and data is available at to estimate target object location via aggregating noisy observations caused by missed detection, view occlusion by the arm, and noisy depth. We show that our approach provides a 3x improvement in success rate over a baseline without this auxiliary information, and it is more robust against noise in depth and agent movements.
Head Pose as a Proxy for Gaze in Virtual Reality
—In this work, we posit that a user’s head pose can serve as a proxy for gaze in a VR object selection task. We describe a study in which participants were asked to describe a series of objects in a known order, providing approximate labels for the focus of attention. The participants’ head pose was then evaluated as a function of the position and orientation of the headset, and how closely that pose matched the location of known objects was calculated. The object that most closely matched the gaze was then evaluated using a mean reciprocal ranking. We demonstrate that using a concept of gaze derived from head pose can be used to effectively narrow the set of objects that are the target of participants’ attention.
SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following
This paper investigates human instruction following for robotic manipulation via a hybrid, modular system with symbolic and connectionist elements. Symbolic methods build modular systems with semantic parsing and task planning modules for producing sequences of actions from natural language requests. Modern connectionist methods employ deep neural networks that learn visual and linguistic features for mapping inputs to a sequence of low-level actions, in an end-to-end fashion. The hybrid, modular system blends these two approaches to create a modular framework: it formulates instruction following as symbolic goal learning via deep neural networks followed by task planning via symbolic planners. Connectionist and symbolic modules are bridged with Planning Domain Definition Language. The vision-and-language learning network predicts its goal representation, which is sent to a planner for producing a task-completing action sequence. For improving the flexibility of natural language, we further incorporate implicit human intents with explicit human instructions. To learn generic features for vision and language, we propose to separately pretrain vision and language encoders on scene graph parsing and semantic textual similarity tasks. Benchmarking evaluates the impacts of different components of, or options for, the vision-and-language learning model and shows the effectiveness of pretraining strategies. Manipulation experiments conducted in the simulator AI2-THOR show the robustness of the framework to novel scenarios.
Target-driven indoor visual navigation using inverse reinforcement learning
Xitong Wang, Qiang Fang, Xin Xu Other Conferences 2020
Deep reinforcement learning has greatly simplified visual navigation by utilizing the end-to-end network training strategy. Unlike previous navigation methods which build upon high-precision maps, deep reinforcement learning-based method enables real-time navigation by only taking one image as input at a time. As such, deep reinforcement learning based navigation methods are applicable to a variety number of applications in robotics/vision communities, thanks to its light-weight computational cost. Despite the advantages, however, these methods still suffer from inefficient data exploration and poor convergence on network training. In this paper, we propose to use inverse reinforcement learning to solve the problem,which can provide more accurate and efficient guidance for decision-making. The proposed method is able to learn a more effective reward function from less training data. Experiments demonstrated that the proposed method achieves a higher success rate of navigation and produces paths that are more similar to the optimal ones compared to the reinforcement learning baselines.
ION: Instance-level Object Navigation
Visual object navigation is a fundamental task in Embodied AI. Previous works focus on the category-wise navigation, in which navigating to any possible instance of target object category is considered a success. Those methods may be effective to find the general objects. However, it may be more practical to navigate to the specific instance in our real life, since our particular requirements are usually satisfied with specific instances rather than all instances of one category. How to navigate to the specific instance has been rarely researched before and is typically challenging to current works. In this paper, we introduce a new task of Instance Object Navigation (ION), where instance-level descriptions of targets are provided and instance-level navigation is required. In particular, multiple types of attributes such as colors, materials and object references are involved in the instance-level descriptions of the targets. In order to allow the agent to maintain the ability of instance navigation, we propose a cascade framework with Instance-Relation Graph (IRG) based navigator and instance grounding module. To specify the different instances of the same object categories, we construct instance-level graph instead of category-level one, where instances are regarded as nodes, encoded with the representation of colors, materials and locations (bounding boxes). During navigation, the detected instances can activate corresponding nodes in IRG, which are updated with graph convolutional neural network (GCNN). The final instance prediction is obtained with the grounding module by selecting the candidates (instances) with maximum probability (a joint probability of category, color and material, obtained by corresponding regressors with softmax). For the task evaluation, we build a benchmark for instance-level object navigation on AI2-THOR simulator, where over 27,735 object instance descriptions and navigation groundtruth are automatically obtained through the interaction with the simulator. The proposed model outperforms the baseline in instance-level metrics, showing that our proposed graph model can guide instance object navigation, as well as leaving promising room for further improvement. The project is available at
Unbiased Directed Object Attention Graph for Object Navigation
Object navigation tasks require agents to locate specific objects in unknown environments based on visual information. Previously, graph convolutions were used to implicitly explore the relationships between objects. However, due to differences in visibility among objects, it is easy to generate biases in object attention. Thus, in this paper, we propose a directed object attention (DOA) graph to guide the agent in explicitly learning the attention relationships between objects, thereby reducing the object attention bias. In particular, we use the DOA graph to perform unbiased adaptive object attention (UAOA) on the object features and unbiased adaptive image attention (UAIA) on the raw images, respectively. To distinguish features in different branches, a concise adaptive branch energy distribution (ABED) method is proposed. We assess our methods on the AI2-THOR dataset. Compared with the state-of-the-art (SOTA) method, our method reports 7.4%, 8.1% and 17.6% increase in success rate (SR), success weighted by path length (SPL) and success weighted by action efficiency (SAE), respectively.
A Simple Approach for Visual Rearrangement: 3D Mapping and Semantic Search
Physically rearranging objects is an important capability for embodied agents. Visual room rearrangement evaluates an agent’s ability to rearrange objects in a room to a desired goal based solely on visual input. We propose a simple yet effective method for this problem: (1) search for and map which objects need to be rearranged, and (2) rearrange each object until the task is complete. Our approach consists of an off-the-shelf semantic segmentation model, voxel-based semantic map, and semantic search policy to efficiently find objects that need to be rearranged. On the AI2-THOR Rearrangement Challenge, our method improves on current state-of-the-art end-to-end reinforcement learning-based methods that learn visual rearrangement policies from 0.53% correct rearrangement to 16.56%, using only 2.7% as many samples from the environment. the accuracy of the perception model, the budget for exploration, and the size of objects being rearranged.
Learning Embeddings that Capture Spatial Semantics for Indoor Navigation
Incorporating domain-specific priors in search and navigation tasks has shown promising results in improving generalization and sample complexity over end-toend trained policies. In this work, we study how object embeddings that capture spatial semantic priors can guide search and navigation task in a structured environment. We know that humans can search for an object like a book, or a plate in an unseen house, based on spatial semantics of bigger objects detected. For example, a book is likely to be on a bookshelf or a table, whereas a plate is likely to be in a cupboard or dishwasher. We propose a method to incorporate such spatial semantic awareness in robots by leveraging pre-trained language models and multirelational knowledge bases as object embeddings. We demonstrate using these object embeddings to search a query object in an unseen indoor environment. We measure the performance of these embeddings in an indoor simulator (AI2-THOR). We further evaluate different pre-trained embedding on Success Rate (SR) and Success weighted by Path Length (SPL). Code is available at:
On Grounded Planning for Embodied Tasks with Language Models
Language models (LMs) are shown to have commonsense knowledge of the physical world, which is fundamental for completing tasks in everyday situations. However, it is still an open question whether LMs have the ability to generate grounded, executable plans for embodied tasks. It is very challenging because LMs do not have an “eye” or “hand” to perceive a realistic environment. In this work, we show the first study on this important research question. We first present a novel problem formulation named G-PlanET, which takes as input a high-level goal and a table of objects in a specific environment. The expected output is a plan consisting of step-by-step instructions for agents to execute. To enable the study of this problem, we establish an evaluation protocol and devise a dedicated metric for assessing the quality of plans. In our extensive experiments, we show that adding flattened tables for encoding environments and using an iterative decoding strategy can both improve the LMs’ ability for grounded planning. Our analysis of the results also leads to interesting non-trivial findings. 1
Deep Reinforcement Learning Visual Navigation Model Integrating Memory-prediction Mechanism
Deep reinforcement learning (DRL) has been widely used in the field of visual navigation. However, due to the lack of adaptability of DRL to the new tasks, the generalization ability of current visual navigation model using DRL is not desired. In order to improve this deficiency, we introduce the memory-prediction mechanism. By enhancing the memory of the scene, and combining the past experience of navigation to predict the next state, a more reasonable action can be obtained. First, we pass the image features extracted during the navigation process to an LSTM, and use LSTM to memorize the scene information in the image features. Then, we combine all the information (including state, target, and action) of each time step in the navigation process, and pass the historical information of multiple time steps to another LSTM to predict the next state. The action performed by the robot is determined by the predicted state. We use the AI2-THOR framework to carry out experiments. The results show that the proposed method can improve the navigation performance of the DRL visual navigation model and improve its adaptability to new tasks.
Deep Reinforcement Learning with New-Field Exploration for Navigation in Detour Environment
Deep Reinforcement Learning (DRL) has made a great progress in recent years with the development of many relative researching areas, such as Deep Learning. Researchers have trained agents to achieve human-level and even beyond human-level scores in video games by using DRL. In the field of robotics, DRL can also achieve satisfactory performance for the navigation task when the environment is relatively simple. However, when environments become complex, e.g., the detour ones, the DRL system often fails to attain good results. To tackle this problem, we propose an internal reward obtaining method called New-Field-Explore (NFE) mechanism which can navigate a robot from initial position to target position without collision in detour environments. We also present a benchmark suite based on the AI2-THOR environment for robot navigation in complex detour environments. The proposed method is evaluated in these environments by comparing the performance of state-of-the-art algorithms with or without the NFE mechanism1. Experimental results show the above reward is effective for mobile robot navigation tasks in detour indoor environments.
Matching options to tasks using Option-Indexed Hierarchical Reinforcement Learning
The options framework in Hierarchical Reinforcement Learning breaks down overall goals into a combination of options or simpler tasks and associated policies, allowing for abstraction in the action space. Ideally, these options can be reused across different higher-level goals; indeed, such reuse is necessary to realize the vision of a continual learning agent that can effectively leverage its prior experience. Previous approaches have only proposed limited forms of transfer of prelearned options to new task settings. We propose a novel option indexing approach to hierarchical learning (OIHRL), where we learn an affinity function between options and the items present in the environment. This allows us to effectively reuse a large library of pretrained options, in zero-shot generalization at test time, by restricting goal-directed learning to only those options relevant to the task at hand. We develop a meta-training loop that learns the representations of options and environments over a series of HRL problems, by incorporating feedback about the relevance of retrieved options to the higher-level goal. We evaluate OI-HRL in two simulated settings – the CraftWorld and AI2-THOR environments – and show that we achieve performance competitive with oracular baselines, and substantial gains over a baseline that has the entire option pool available for learning the hierarchical policy.
Learning Parameterized Task Structure for Generalization to Unseen Entities
Real world tasks are hierarchical and compositional. Tasks can be composed of multiple subtasks (or sub-goals) that are dependent on each other. These subtasks are defined in terms of entities (e.g., "apple", "pear") that can be recombined to form new subtasks (e.g., "pickup apple", and "pickup pear"). To solve these tasks efficiently, an agent must infer subtask dependencies (e.g. an agent must execute "pickup apple" before "place apple in pot"), and generalize the inferred dependencies to new subtasks (e.g. "place apple in pot" is similar to "place apple in pan"). Moreover, an agent may also need to solve unseen tasks, which can involve unseen entities. To this end, we formulate parameterized subtask graph inference (PSGI), a method for modeling subtask dependencies using first-order logic with factored entities. To facilitate this, we learn parameter attributes in a zero-shot manner, which are used as quantifiers (e.g. is_pickable(X)) for the factored subtask graph. We show this approach accurately learns the latent structure on hierarchical and compositional tasks more efficiently than prior work, and show PSGI can generalize by modelling structure on subtasks unseen during adaptation.
Do People Trust Robots that Learn in the Home?
—It is not scalable for assistive robotics to have all functionalities pre-programmed prior to user introduction. Instead, it is more realistic for agents to perform supplemental on site learning. This opportunity to learn user and environment particularities is especially helpful for care robots that assist with individualized caregiver activities in residential or nursing home environments. Many assistive robots, ranging in complexity from Roomba to Pepper, already conduct some of their learning in the home, observable to the user. We lack an understanding of how witnessing this learning impacts the user. Thus, we propose to assess end-user attitudes towards the concept of embodied robots that conduct some learning in the home as compared to robots that are delivered fully-capable. In this virtual, between- subjects study, we recruit end users (care-givers and care-takers) from nursing homes, and investigate user trust in three different domains: navigation, manipulation, and preparation. Informed by the first study where we identify agent learning as a key factor in determining trust, we propose a second study to explore how to modulate that trust. This second, in-person study investigates the effectiveness of apologies, explanations of robot failure, and transparency of learning at improving trust in embodied learning robots.
Towards Disturbance-Free Visual Mobile Manipulation
Deep reinforcement learning has shown promising results on an abundance of robotic tasks in simulation, including visual navigation and manipulation. Prior work generally aims to build embodied agents that solve their assigned tasks as quickly as possible, while largely ignoring the problems caused by collision with objects during interaction. This lack of prioritization is understandable: there is no inherent cost in breaking virtual objects. As a result, “well-trained” agents frequently collide with objects before achieving their primary goals, a behavior that would be catastrophic in the real world. In this paper, we study the problem of training agents to complete the task of visual mobile manipulation in the ManipulaTHOR environment while avoiding unnecessary collision (disturbance) with objects. We formulate disturbance avoidance as a penalty term in the reward function, but find that directly training with such penalized rewards often results in agents being unable to escape poor local optima. Instead, we propose a two-stage training curriculum where an agent is first allowed to freely explore and build basic competencies without penalization, after which a disturbance penalty is introduced to refine the agent’s behavior. Results on testing scenes show that our curriculum not only avoids these poor local optima, but also leads to 10% absolute gains in success rate without disturbance, compared to our state-of-the-art baselines. More-over, our curriculum is significantly more performant than a safe RL algorithm that casts collision avoidance as a constraint. Finally, we propose a novel disturbance-prediction auxiliary task that accelerates learning. 1
Prompter: Utilizing Large Language Model Prompting for a Data Efficient Embodied Instruction Following
—Embodied Instruction Following (EIF) studies how mobile manipulator robots should be controlled to accomplish long-horizon tasks specified by natural language instructions. While most research on EIF are conducted in simulators, the ultimate goal of the field is to deploy the agents in real life. As such, it is important to minimize the data cost required for training an agent, to help the transition from sim to real. However, many studies only focus on the performance and overlook the data cost- modules that require separate training on extra data are often introduced without a consideration on deployability. In this work, we propose FILM++ which extends the existing work FILM [1] with modifications that do not require extra data. While all data-driven modules are kept constant, FILM++ more than doubles FILM’s performance. Furthermore, we propose Prompter, which replaces FILM++’s semantic search module with language model prompting. Unlike FILM++’s implementation that requires training on extra sets of data, no training is needed for our prompting based implementation while achieving better or at least comparable performance. Prompter achieves 42.64% and 45.72% on the ALFRED benchmark with high-level instructions only and with step-by-step instructions, respectively, outperforming the previous state of the art by 6.57% and 10.31%.
A Framework to Co-Optimize Robot Exploration and Task Planning in Unknown Environments
Robots often need to accomplish complex tasks in unknown environments, which is a challenging problem, involving autonomous exploration for acquiring necessary scene knowledge and task planning. In traditional approaches, the agent first explores the environment to instantiate a complete planning domain and then invokes a symbolic planner to plan and perform high-level actions. However, task execution is inefficient since the two processes involve many repetitive states and actions. Hence, this letter proposes a framework to co-optimize robot exploration and task planning in unknown environments. To afford robot exploration and symbolic planning not being independent and separated, we design a unified structure named subtask, which is exploited to decompose the robot exploration and planning phases. To select the appropriate subtask each time, we develop a value function and a value-based scheduler to co-optimize exploration and task processing. Our framework is evaluated in a photo-realistic simulator with three complex household tasks, increasing task efficiency by 25%–29%.
Search for or Navigate to? Dual Adaptive Thinking for Object Navigation
“Search for” or “Navigate to”? When finding an object, the two choices always come up in our subconscious mind. Before seeing the target, we search for the target based on ex- perience. After seeing the target, we remember the target location and navigate to. However, recently methods in object navigation field almost only consider using object association to enhance “search for” phase while neglect the impor- tance of “navigate to” phase. Therefore, this paper proposes the dual adaptive thinking (DAT) method to flexibly adjust the different thinking strategies at different navigation stages. Dual thinking includes search thinking with the object associ- ation ability and navigation thinking with the target location ability. To make the navigation thinking more effective, we design the target-oriented memory graph (TOMG) to store historical target information and the target-aware multi-scale aggregator (TAMSA) to encode the relative target position. We assess our methods on the AI2-THOR dataset. Compared with the state-of-the-art (SOTA) method, our method reports 10.8%, 21.5% and 15.7% increase in success rate (SR), suc- cess weighted by path length (SPL) and success weighted by navigation efficiency (SNE), respectively.
Are you doing what I say? On modalities alignment in ALFRED
ALFRED is a recently proposed benchmark that requires a model to complete tasks in simulated house environments specified by instructions in natural language. We hypothesize that key to success is accurately aligning the text modality with visual inputs. Motivated by this, we inspect how well existing models can align these modalities using our proposed intrinsic metric, boundary adherence score (BAS). The results show the previous models are indeed failing to perform proper alignment. To address this issue, we introduce approaches aimed at improving model alignment and demonstrate how improved alignment, improves end task performance.
Learning to Explore, Navigate and Interact for Visual Room Rearrangement
Intelligent agents for visual room rearrangement aim to reach a goal room configuration from a cluttered room configuration via a sequence of interactions. For successful visual room rearrangement, the agents need to learn to explore, navigate and interact within the surrounding environments. Contemporary methods for visual room rearrangement display unsatisfactory performance even with stateof-the-art techniques for embodied AI. One of the causes for the low performance arises from the expensive cost of learning in an end-to-end manner. To overcome the limitation, we design a three-phased modular architecture (TMA) for visual room rearrangement. TMA performs visual room rearrangement in three phases: the exploration phase, the inspection phase, and the rearrangement phase. The proposed TMA maximizes the performance by placing the learning modules along with hand-crafted feature engineering modules—retaining the advantage of learning while reducing the cost of learning.
Good Time to Ask: A Learning Framework for Asking for Help in Embodied Visual Navigation
: In reality, it is often more efficient to ask for help than to search the entire space to find an object with an unknown location. We present a learning framework that enables an agent to actively ask for help in such embodied visual navigation tasks, where the feedback informs the agent of where the goal is in its view. To emulate the real-world scenario that a teacher may not always be present, we propose a training curriculum where feedback is not always available. We formulate an uncertainty measure of where the goal is and use empirical results to show that through this approach, the agent learns to ask for help effectively while remaining robust when feedback is not available.
Target-based Visual Navigation with Channel-aware Network
Huichao Li, X. Ren, Y. Lv DDCLS 2019
Visual navigation is major content of robot control, especially those based on target. We propose a channel-aware deep siamese actor-critic network for target-based visual navigation task. Compared with previous target-driven network, our model can obtain a joint representation of siamese network's output feature by using distance fusion method, and the approach significantly accelerates convergence of model's training. We improve the model performance to make the agent reach the goal in shorter path during the navigation by inserting a modified Squeeze-and-Excitation block in siamese layers, in which way the model can take dependencies between visual feature channels into consideration.
VSGM - Enhance robot task understanding ability through visual semantic graph
In recent years, developing AI for robotics has raised much attention. The interaction of vision and language of robots is particularly difficult. We consider that giving robots an understanding of visual semantics and language semantics will improve inference ability. In this paper, we propose a novel method-VSGM (Visual Semantic Graph Memory), which uses the semantic graph to obtain better visual image features, improve the robot’s visual understanding ability. By providing prior knowledge of the robot and detecting the objects in the image, it predicts the correlation between the attributes of the object and the objects and converts them into a graph-based representation; and mapping the object in the image to be a top-down egocentric map. Finally, the important object features of the current task are extracted by Graph Neural Networks. The method proposed in this paper is verified in the ALFRED (Action Learning From Realistic Environments and Directives) dataset. In this dataset, the robot needs to perform daily indoor household tasks following the required language instructions. After the model is added to the VSGM, the task success rate can be improved by 6~10 %.1
Success-Aware Visual Navigation Agent
This work presents a method to improve the efficiency and robustness of the previous model-free Reinforcement Learning (RL) algorithms for the task of object-target visual navigation. Despite achieving the state-of-the-art results, one of the major drawbacks of those approaches is the lack of a forward model that informs the agent about the potential consequences if its actions, e.g. being modelfree. In this work we take a step towards augmenting the model-free methods with a forward model that is trained along with the policy, using a replay buffer, and can predict a successful future state of an episode in a challenging 3D navigation environment. We develop a module that can predict a representation of a future state, from the beginning of a navigation episode, if the episode were to be successful; we call this ForeSIM module. ForeSIM is trained to imagine a future latent state that leads to success. Therefore, during navigation, the policy is able to take better actions leading to two main advantages: first, in the absence of an object detector, ForeSIM leads mainly to a more robust policy, e.g. about 5% absolute improvement on success rate; second, when combined with an off-the-shelf object detector to help better distinguish the target object, ForeSIM leads to about 3% absolute improvement on success rate and about 2% absolute improvement on Success weighted by inverse Path Length (SPL), e.g. higher efficiency.
Active Object Search
In this work, we investigate an Active Object Search (AOS) task that is not explicitly addressed in the literature. It aims to actively perform as few action steps as possible to search and locate the target object in a 3D indoor scene. Different from classic object detection that passively receives visual information, this task encourages an intelligent agent to perform active search via reasonable action planning; thus it can better recall the target objects, especially for the challenging situations that the target is far from the agent, blocked by an obstacle and out of view. To handle this AOS task, we formulate a reinforcement learning framework that consists of a 3D object detector, a state controller and a cross-modal action planner to work cooperatively to find out the target object with minimal action steps. During training, we design a novel cost-sensitive active search reward that penalizes inaccurate object search and redundant action steps. To evaluate this novel task, we construct an Active Object Search (AOS) benchmark that contains 5,845 samples from 30 diverse indoor scenes. We conduct extensive qualitative and quantitative evaluations on this benchmark to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to address this task.
EvEntS ReaLM: Event Reasoning of Entity States via Language Models
This paper investigates models of event implications. Specifically, how well models predict entity state-changes, by targeting their understanding of physical attributes. Nomi-nally, Large Language models (LLM) have been exposed to procedural knowledge about how objects interact, yet our benchmarking shows they fail to reason about the world. Conversely, we also demonstrate that existing approaches often misrepresent the surprising abilities of LLMs via improper task encodings and that proper model prompting can dramat-ically improve performance of reported baseline results across multiple tasks. In particular, our results indicate that our prompting technique is especially useful for unseen attributes (out-of-domain) or when only limited data is available. 1
Learning to Act with Affordance-Aware Multimodal Neural SLAM
—Recent years have witnessed an emerging paradigm shift toward embodied artificial intelligence, in which an agent must learn to solve challenging tasks by interacting with its environment. There are several challenges in solving embodied multimodal tasks, including long-horizon planning, vision-and-language grounding, and efficient exploration. We focus on a critical bottleneck, namely the performance of planning and navigation. To tackle this challenge, we propose a Neural SLAM approach that, for the first time, utilizes several modalities for exploration, predicts an affordance-aware semantic map, and plans over it at the same time. This signif- icantly improves exploration efficiency, leads to robust long-horizon planning, and enables effective vision-and-language grounding. With the proposed Affordance-aware Multimodal Neural SLAM (AMSLAM) approach, we obtain more than 40% improvement over prior published work on the ALFRED benchmark and set a new state-of-the-art generalization per- formance at a success rate of 23 . 48% on the test unseen scenes.
SAILenv: Learning in Virtual Visual Environments Made Simple
Recently, researchers in Machine Learning algorithms, Computer Vision scientists, engineers and others, showed a growing interest in 3D simulators as a mean to artificially create experimental settings that are very close to those in the real world. However, most of the existing platforms to interface algorithms with 3D environments are often designed to setup navigation-related experiments, to study physical interactions, or to handle ad-hoc cases that are not thought to be customized, sometimes lacking a strong photorealistic appearance and an easy-to-use software interface. In this paper, we present a novel platform, SAILenv, that is specifically designed to be simple and customizable, and that allows researchers to experiment visual recognition in virtual 3D scenes. A few lines of code are needed to interface every algorithm with the virtual world, and non-3D-graphics experts can easily customize the 3D environment itself, exploiting a collection of photorealistic objects. Our framework yields pixel-level semantic and instance labeling, depth, and, to the best of our knowledge, it is the only one that provides motion-related information directly inherited from the 3D engine. The client-server communication operates at a low level, avoiding the overhead of HTTP-based data exchanges. We perform experiments using a state-of-the-art object detector trained on real-world images, showing that it is able to recognize the photorealistic 3D objects of our environment. The computational burden of the optical flow compares favourably with the estimation performed using modern GPU-based convolutional networks or more classic implementations. We believe that the scientific community will benefit from the easiness and high-quality of our framework to evaluate newly proposed algorithms in their own customized realistic conditions.
Object-oriented Map Exploration and Construction Based on Auxiliary Task Aided DRL
Environment exploration by autonomous robots through deep reinforcement learning (DRL) based methods has attracted more and more attention. However, existing methods usually focus on robot navigation to single or multiple fixed goals, while ignoring the perception and construction of external environments. In this paper, we propose a novel environment exploration task based on DRL, which requires a robot fast and completely perceives all objects of interest, and reconstructs their poses in a global environment map, as much as the robot can do. To this end, we design an auxiliary task aided DRL model, which is integrated with the auxiliary object detection and 6-DoF pose estimation components. The outcome of auxiliary tasks can improve the learning speed and robustness of DRL, as well as the accuracy of object pose estimation. Comprehensive experimental results on the indoor simulation platform AI2-THOR have shown the effectiveness and robustness of our method.
DoRO: Disambiguation of Referred Object for Embodied Agents
Robotic task instructions often involve a referred object that the robot must locate (ground) within the environment. While task intent understanding is an essential part of natural language understanding, less effort is made to resolve ambiguity that may arise while grounding the task. Existing works use vision-based task grounding and ambiguity detection, suitable for a fixed view and a static robot. However, the problem magnifies for a mobile robot, where the ideal view is not known beforehand. Moreover, a single view may not be sufficient to locate all the object instances in the given area, which leads to inaccurate ambiguity detection. Human intervention is helpful only if the robot can convey the kind of ambiguity it is facing. In this article, we present DoRO (Disambiguation of Referred Object), a system that can help an embodied agent to disambiguate the referred object by raising a suitable query whenever required. Given an area where the intended object is, DoRO finds all the instances of the object by aggregating observations from multiple views while exploring & scanning the area. It then raises a suitable query using the information from the grounded object instances. Experiments conducted with the AI2-THOR simulator show that DoRO not only detects the ambiguity more accurately but also raises verbose queries with more accurate information from the visual-language grounding.
Moment-based Adversarial Training for Embodied Language Comprehension
Shintaro Ishikawa, Komei Sugiura International Conference on Pattern Recognition 2022
In this paper, we focus on a vision-and-language task in which a robot is instructed to execute household tasks. Given an instruction such as "Rinse off a mug and place it in the coffee maker," the robot is required to locate the mug, wash it, and put it in the coffee maker. This is challenging because the robot needs to break down the instruction sentences into subgoals and execute them in the correct order. On the ALFRED benchmark, the performance of state-of-the-art methods is still far lower than that of humans. This is partially because existing methods sometimes fail to infer subgoals that are not explicitly specified in the instruction sentences. We propose Moment-based Adversarial Training (MAT), which uses two types of moments for perturbation updates in adversarial training. We introduce MAT to the embedding spaces of the instruction, subgoals, and state representations to handle their varieties. We validated our method on the ALFRED benchmark, and the results demonstrated that our method outperformed the baseline method for all the metrics on the benchmark.
Embodied Multi-Agent Task Planning from Ambiguous Instruction
Xinzhu Liu, Xinghang Li, Di Guo, Sinan Tan, Huaping Liu, F. Sun Robotics: Science and Systems XVIII 2022
—In human-robots collaboration scenarios, a human would give robots an instruction that is intuitive for the human himself to accomplish. However, the instruction given to robots is likely ambiguous for them to understand as some information is implicit in the instruction. Therefore, it is necessary for the robots to jointly reason the operation details and perform the embodied multi-agent task planning given the ambiguous instruction. This problem exhibits significant challenges in both language understanding and dynamic task planning with the perception information. In this work, an embodied multi-agent task planning framework is proposed to utilize external knowledge sources and dynamically perceived visual information to resolve the high-level instructions, and dynamically allocate the decomposed tasks to multiple agents. Furthermore, we utilize the semantic information to perform environment perception and generate sub-goals to achieve the navigation motion. This model effectively bridges the difference between the simulation environment and the physical environment, thus it can be simultaneously applied in both simulation and physical scenarios and avoid the notori- ous sim2real problem. Finally, we build a benchmark dataset to validate the embodied multi-agent task planning problem, which includes three types of high-level instructions in which some target objects are implicit in instructions. We perform the evaluation experiments on the simulation platform and in physical scenarios, demonstrating that the proposed model can achieve promising results for multi-agent collaborative tasks.
Role of reward shaping in object-goal navigation
Deep reinforcement learning approaches have been a popular method for visual navigation tasks in the computer vision and robotics community of late. In most cases, the reward function has a binary structure, i.e., a large positive reward is provided when the agent reaches goal state, and a negative step penalty is assigned for every other state in the environment. A sparse signal like this makes the learning process challenging, specially in big environments, where a large number of sequential actions need to be taken to reach the target. We introduce a reward shaping mechanism which gradually adjusts the reward signal based on distance to the goal. Detailed experiments conducted using the AI2-THOR simulation environment demonstrate the ef-ficacy of the proposed approach for object-goal navigation tasks.
Visual Experience-Based Question Answering with Complex Multimodal Environments
Incheol Kim Mathematical Problems in Engineering 2020
This paper proposes a novel visual experience-based question answering problem (VEQA) and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand 3D scenes from successive partial input images, and answer natural language questions about its visual experiences in real time. Unlike the conventional visual question answering (VQA), the VEQA problem assumes both partial observability and dynamics of a complex multimodal environment. To address this VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system. The proposed system can generate more accurate scene graphs for dynamic environments with some uncertainty. Moreover, it can answer complex questions through knowledge reasoning with rich background knowledge. Results of experiments using a photo-realistic 3D simulated environment, AI2-THOR, and the VEQA benchmark dataset prove the high performance of the proposed system.
Tracking and Planning with Spatial World Models
We introduce a method for real-time navigation and tracking with differentiably rendered world models. Learning models for control has led to impressive results in robotics and computer games, but this success has yet to be extended to vision-based navigation. To address this, we transfer ad-vances in the emergent field of differentiable rendering to model-based control. We do this by planning in a learned 3D spatial world model, combined with a pose estimation algorithm previously used in the context of TSDF fusion, but now tailored to our setting and improved to incorporate agent dynamics. We evaluate over six simulated environments based on complex human-designed floor plans and provide quantitative results. We achieve up to 92% navigation success rate at a fre-quency of 15 Hz using only image and depth observations under stochastic, continuous dynamics.
We propose a framework to continuously learn object-centric representations for visual learning and understanding. Existing object-centric representations either rely on supervisions that individualize objects in the scene, or perform unsupervised disentanglement that can hardly deal with complex scenes in the real world. To mitigate the annotation burden and relax the constraints on the statistical complexity of the data, our method leverages interactions to effectively sample diverse variations of an object and the corresponding training signals while learning the object-centric representations. Throughout learning, objects are streamed one by one in random order with unknown identities, and are associated with latent codes that can synthesize discriminative weights for each object through a convolutional hypernetwork. Moreover, re-identification of learned objects and forgetting prevention are employed to make the learning process efficient and robust. We perform an extensive study of the key features of the proposed framework and analyze the characteristics of the learned representations. Furthermore, we demonstrate the capability of the proposed framework in learning representations that can improve label efficiency in downstream tasks. Our code and trained models will be made publicly available.
Understanding Natural Language in Context
Recent years have seen an increasing number of applications that have a natural language interface, either in the form of chatbots or via personal assistants such as Alexa (Amazon), Google Assistant, Siri (Apple), and Cortana (Microsoft). To use these applications, a basic dialog between the robot and the human is required. While this kind of dialog exists today mainly within "static" robots that do not make any movement in the household space, the challenge of reasoning about the information conveyed by the environment increases significantly when dealing with robots that can move and manipulate objects in our home environment. In this paper, we focus on cognitive robots [9], which have some knowledge-based models of the world and operate by reasoning and planning with this model. Thus, when the robot and the human communicate, there is already some formalism they can use – the robot’s knowledge representation formalism. Our goal in this research is to translate natural language utterances into this robot’s formalism, allowing much more complicated household tasks to be completed. We do so by combining off-the-shelf SOTA language models, planning tools, and the robot’s knowledge-base for better communication. In addition, we analyze different directive types and illustrate the contribution of the world’s context to the translation process.
Learning Safety-Aware Policy with Imitation Learning for Context-Adaptive Navigation
This paper presents an Imitation Learning (IL) based visual navigation system, which could guide the robots navigating from some start position to a goal location without any explicit map. We pay close attention to the safety issue due to partially-observability and data distribution mismatching—when the robot meets some incomplete or unfamiliar states, it probably performs an unsafe action, making it hard to work on lifelong robot navigation. In this paper, a sequenceto-sequence (Seq2seq) deep neural network is built to enhance the agent’s context-awareness in partially-observable conditions and boost the model’s adaptability to unseen scenarios. Additionally, we propose Uncertainty-Aware Imitation Learning (UAIL) by explicitly estimating model uncertainty and actively request experts for labeling samples according to the uncertainty with On-Policy IL. Simulations demonstrated that the combined method—Safety-Aware Imitation Learning (SAIL) in goal-driven visual navigation achieves 35.6% shorter expected moving steps and 22% fewer collisions compared with current counterparts. With the learned safer policy, SAIL had be successfully adapted to unseen environments with minimal navigation performance loss.
Summarizing a virtual robot's past actions in natural language
: We demonstrate the task of giving natural language summaries of the 1 actions of a robotic agent’s actions in a virtual environment. Existing datasets that 2 match robot actions with natural language descriptions designed for instruction 3 following tasks can be repurposed to serve as a training ground for robot action 4 summarization. We propose and test several methods of learning to generate such 5 summaries, starting from either egocentric video frames of the robot taking actions 6 or text representations of the actions and find a two stage summarization process 7 which uses structured language as an intermediate step improves accuracy. Quan- 8 titative and qualitative evaluations of the results are provided to serve as a baseline 9 for future work. 10
Exploring multi-view perspectives on deep reinforcement learning agents for embodied object navigation in virtual home environments
Recent years have brought the exploration of embodied reinforcement learning agents in a variety of domains. One of the advantages of artificial agents is that they can obtain visual inputs simultaneously using multiple input devices. This work explores multi-view reinforcement learning for object navigation tasks in 3D rendered virtual home environments using AI2-THOR. We trained CNN based Deep Q-learning embodied agents with egocentric, allocentric, and combined egocentric-allocentric perspectives to locate an object in an unknown environment. We compared the results of the three RL agents, and evaluated them by both reward improvement rate, and reward obtained. We demonstrate that the egocentric perspective allows for faster reward accumulation in the earlier episodes, whereas the allocentric agents obtained better long-term rewards. Interesting results arise from the combined allocentric and egocentric perspective, where we found that the agent had the best overall results by harnessing the benefits of each perspective. The results show that while single perspective embodied agents each have their own advantages, combining both inputs yield the best overall reward. Our findings provide a foundation and benchmark for building embodied RL agents with multi-view perspectives. CCS CONCEPTS • Computing methodologies→ Artificial Intelligence, Machine learning.
Embodied Referring Expression for Manipulation Question Answering in Interactive Environment
—Embodied agents are expected to perform more complicated tasks in an interactive environment, with the progress of Embodied AI in recent years. Existing embodied tasks including Embodied Referring Expression (ERE) and other QA-form tasks mainly focuses on interaction in term of linguistic instruction. Therefore, enabling the agent to manipulate objects in the environment for exploration actively has become a challenging problem for the community. To solve this problem, We introduce a new embodied task: Remote Embodied Manipulation Question Answering (REMQA) to combine ERE with manipulation tasks. In the REMQA task, the agent needs to navigate to a remote position and perform manipulation with the target object to answer the question. We build a benchmark dataset for the REMQA task in the AI2-THOR simulator. To this end, a framework with 3D semantic reconstruction and modular network paradigms is proposed. The evaluation of the proposed framework on the REMQA dataset is presented to validate its effectiveness.
Improving the Robustness to Variations of Objects and Instructions with a Neuro-Symbolic Approach for Interactive Instruction Following
. An interactive instruction following task has been proposed as a benchmark for learning to map natural language instructions and first-person vision into sequences of actions to interact with objects in 3D environments. We found that an existing end-to-end neural model for this task tends to fail to interact with objects of unseen attributes and follow various instructions. We assume that this problem is caused by the high sensitivity of neural feature extraction to small changes in vision and language inputs. To mitigate this problem, we propose a neuro-symbolic approach that utilizes high-level symbolic features, which are robust to small changes in raw inputs, as intermediate representations. We verify the effectiveness of our model with the subtask evaluation on the ALFRED benchmark. Our experiments show that our approach significantly outperforms the end-to-end neural model by 9, 46, and 74 points in the success rate on the ToggleObject, PickupObject, and SliceObject subtasks in unseen environments respectively.
Multi-Agent Asynchronous Cooperation with Hierarchical Reinforcement Learning
. Hierarchical multi-agent reinforcement learning (MARL) has shown a significant learning efficiency by searching policy over higher-level, temporally extended actions (options). However, standard policy gradient-based MARL methods have a difficulty generalizing to option-based scenarios due to the asynchronous executions of multi-agent options. In this work, we propose a mathematical framework to enable policy gradient optimization over asynchronous multi-agent options by adjusting option-based policy distribution as well as trajectory probability. We study our method under a set of multi-agent cooperative setups with varying inter-dependency levels, and evaluate the effectiveness of our method on typical option-based multi-agent cooperation tasks.
Head Pose for Object Deixis in VR-Based Human-Robot Interaction
Modern robotics heavily relies on machine learning and has a growing need for training data. Advances and commercialization of virtual reality (VR) present an opportunity to use VR as a tool to gather such data for human-robot interactions. We present the Robot Interaction in VR simulator, which allows human participants to interact with simulated robots and environments in real-time. We are particularly interested in spoken interactions between the human and robot, which can be combined with the robot’s sensory data for language grounding. To demonstrate the utility of the simulator, we describe a study which investigates whether a user’s head pose can serve as a proxy for gaze in a VR object selection task. Participants were asked to describe a series of known objects, providing approximate labels for the focus of attention. We demonstrate that using a concept of gaze derived from head pose can be used to effectively narrow the set of objects that are the target of participants’ attention and linguistic descriptions.
Good Time to Ask: A Learning Framework for Asking for Help in Embodied Visual Navigation
: In reality, it is often more efficient to ask for help than to search the entire space to find an object with an unknown location. We present a learning framework that enables an agent to actively ask for help in such embodied visual navigation tasks, where the feedback informs the agent of where the goal is in its view. To emulate the real-world scenario that a teacher may not always be present, we propose a training curriculum where feedback is not always available. We formulate an uncertainty measure of where the goal is and use empirical results to show that through this approach, the agent learns to ask for help effectively while remaining robust when feedback is not available.
Ask4Help: Learning to Leverage an Expert for Embodied Tasks
Embodied AI agents continue to become more capable every year with the advent of new models, environments, and benchmarks, but are still far away from being performant and reliable enough to be deployed in real, user-facing, applications. In this paper, we ask: can we bridge this gap by enabling agents to ask for assistance from an expert such as a human being? To this end, we propose the A SK 4H ELP policy that augments agents with the ability to request, and then use expert assistance. A SK 4H ELP policies can be efficiently trained without modifying the original agent’s parameters and learn a desirable trade-off between task performance and the amount of requested help, thereby reducing the cost of querying the expert. We evaluate A SK 4H ELP on two different tasks – object goal navigation and room rearrangement and see substantial improvements in performance using minimal help. On object navigation, an agent that achieves a 52% success rate is raised to 86% with 13% help and for rearrangement, the state-of-the-art model with a 7% success rate is dramatically improved to 90 . 4% using 39% help. Human trials with A SK 4H ELP demonstrate the efficacy of our approach in practical scenarios. We release the code for Ask4Help here: https: // .
Dialogue Object Search
We envision robots that can collaborate and communicate seamlessly with humans. It is necessary for such robots to decide both what to say and how to act, while interacting with humans. To this end, we introduce a new task, dialogue object search: A robot is tasked to search for a target object (e.g. fork) in a human environment (e.g., kitchen), while engaging in a “video call” with a remote human who has additional but inexact knowledge about the target’s location. That is, the robot conducts speech-based dialogue with the human, while sharing the image from its mounted camera. This task is challenging at multiple levels, from data collection, algorithm and system development, to evaluation. Despite these challenges, we believe such a task blocks the path towards more intelligent and collaborative robots. In this extended abstract, we motivate and introduce the dialogue object search task and analyze examples collected from a pilot study. We then discuss our next steps and conclude with several challenges on which we hope to receive feedback.
Evaluating Continual Learning Algorithms by Generating 3D Virtual Environments
Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment. Trying to simulate this learning process in machines is a challenging task, also due to the inherent difficulty in creating conditions for designing continuously evolving dynamics that are typical of the real-world. Many existing research works usually involve training and testing of virtual agents on datasets of static images or short videos, considering sequences of distinct learning tasks. However, in order to devise continual learning algorithms that operate in more realistic conditions, it is fundamental to gain access to rich, fully-customizable and controlled experimental playgrounds. Focussing on the specific case of vision, we thus propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially lifelong dynamic scenes with photo-realistic appearance. Scenes are composed of objects that move along variable routes with different and fully customizable timings, and randomness can also be included in their evolution. A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives. These general principles are concretely implemented exploiting a recently published 3D virtual environment. The user can generate scenes without the need of having strong skills in computer graphics, since all the generation facilities are exposed through a simple high-level Python interface. We publicly share the proposed generator.
On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets
Natural language guided embodied task completion is a challenging problem since it requires understanding natural language instructions, aligning them with egocentric visual observations, and choosing appropriate actions to execute in the environment to produce desired changes. We experiment with augmenting a transformer model for this task with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action. We observed that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED. However, our best model selected using the unseen validation set underperforms on the unseen test split of ALFRED, indicating that performance on the unseen validation set may not in itself be a sufficient indicator of whether model improvements generalize to unseen test sets. We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits, and highlights the need to modify benchmark design to better account for variance in model performance.
Leveraging Semantics for Incremental Learning in Multi-Relational Embeddings
Service robots benefit from encoding information in semantically meaningful ways to enable more robust task execution. Prior work has shown multi-relational embeddings can encode semantic knowledge graphs to promote generalizability and scalability, but only within a batched learning paradigm. We present Incremental Semantic Initialization (ISI), an incremental learning approach that enables novel semantic concepts to be initialized in the embedding in relation to previously learned embeddings of semantically similar concepts. We evaluate ISI on mined AI2-THOR and MatterPort3D datasets; our experiments show that on average ISI improves immediate query performance by 41.4%. Additionally, ISI methods on average reduced the number of epochs required to approach model convergence by 78.2%.
Action-Insensitive Embodied Visual Navigation
Embodied visual navigation is an important task that the agent learns to navigate to a specific target object based on egocentric visual observations, by performing specific actions in the environment. However, there exists a problem of mismatch between the training and testing action spaces through learning methods, and methods used to solve this problem have been scarcely developed. In this paper, we propose a novel problem of the action-insensitive embodied visual navigation task with different action spaces of the agent between the training and testing process. A robust adversary learning framework is built to learn a general and robust policy that can adapt properly to different action spaces. The proposed model in the first-stage adversary training learns a robust feature representation of the agent’s states and transfers the trained strategy to new action spaces with fewer training samples in the second-stage adaptation training. Experiments on 3D indoor scenes validate the effectiveness of the proposed approach.
Long-Sighted Imitation Learning for Partially Observable Control
Imitation Learning (IL) has facilitated many effective and efficient controllers for autonomous agents. Nevertheless, current methods suffer from severe partial observability problems when given incomplete observations, leading to short-sighted behaviors in decision-making tasks. To overcome these shortcomings, this paper presents a Long-Sighted Imitation Learning approach by expanding visual perception and memory size. Firstly, we utilize a deep siamese network that take both current observation and goal state as input, which is especially effective on goal-oriented tasks. Furthermore, inspired by the success of Deep Recurrent Q-Network (DRQN), we introduce recurrency into imitation learning by appending a Gated Recurrent Units (GRUs) layer right after the last fully-connected layer. Extensive experiments on goal-oriented navigation tasks demonstrate that our method outweigh current counterparts.
Collision Anticipation via Deep Reinforcement Learning for Visual Navigation
Visual navigation is the ability of an autonomous agent to find its way in a large and complex environment based on visual information. It is indeed a fundamental problem in computer vision and robotics. In this paper, we propose a deep reinforcement learning approach which is able to learn to navigate a scene to reach a given visual target, but anticipating the possible collisions with the environment. Technically, we propose a map-less-based model, which follows an actor-critic reinforcement learning method where the reward function has been designed to be collision aware. We offer a thorough experimental evaluation of our solution in the AI2-THOR virtual environment, where the results show that our proposed method: (1) improves the state of the art in terms of number of steps and collisions; (2) is able to converge faster than a model which does not care about the collisions, simply searching for the shortest paths; and (3) offers an interesting generalization capability to reach visual targets that have never been seen during training.
DANLI: Deliberative Agent for Following Natural Language Instructions
Recent years have seen an increasing amount of work on embodied AI agents that can perform tasks by following human language instructions. However, most of these agents are reactive , meaning that they simply learn and imitate behaviors encountered in the training data. These reactive agents are insufficient for long-horizon complex tasks. To address this limitation, we propose a neuro-symbolic deliberative agent that, while following language instructions, proactively applies reasoning and planning based on its neural and symbolic representations acquired from past expe-rience (e.g., natural language and egocentric vision). We show that our deliberative agent achieves greater than 70% improvement over reactive baselines on the challenging TEACh benchmark. Moreover, the underlying reasoning and planning processes, together with our modular framework, offer impressive transparency and explainability to the behaviors of the agent. This enables an in-depth understanding of the agent’s capabilities, which shed light on challenges and opportunities for future embodied agents for instruction following. The code is available at https://github. com/sled-group/DANLI .
Planning Large-scale Object Rearrangement Using Deep Reinforcement Learning
Object rearrangement is about moving a set of objects from an initial state to a goal state through task and motion planning. Existing methods either show poor scalability in number of objects they can handle, or do not generalize well across situations, or need explicit running buffers to avoid collisions during placements. In this paper, we propose a deep-RL based task planning method to solve large-scale object rearrangement problems. Given the source and target state of objects in the form of images, our method determines a collision-free object movement plan. Our method produces a feasible plan in discrete-continuous action space where picking the selected objects are discrete actions followed by a set of continuous actions to place the object. We propose a novel hierarchical dense reward structure to train our deep-RL network to make our method more sample efficient using the AI2-THOR simulator. We show that our method works well on unseen publicly available datasets and on a publicly available simulation environment such as Pybullet thereby demonstrating the superiority of our method in terms of generalizability. To the best of our knowledge, our method is the first one that demonstrates the rearrangement across different scenarios from 2D surfaces such as tabletops to 3D rooms with a large number of objects and without any explicit need of buffer space.