Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help

Li, Xin; Zhang, Yu; Yuan, Weilin; Luo, Junren

doi:10.3390/app12147053

Open AccessArticle

Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(14), 7053; https://doi.org/10.3390/app12147053

Submission received: 9 June 2022 / Revised: 5 July 2022 / Accepted: 9 July 2022 / Published: 13 July 2022

Download

Browse Figures

Versions Notes

Abstract

:

Vision-and-Language Navigation (VLN) is a task designed to enable embodied agents carry out natural language instructions in realistic environments. Most VLN tasks, however, are guided by an elaborate set of instructions that is depicted step-by-step. This approach deviates from real-world problems in which humans only describe the object and its surroundings and allow the robot to ask for help when required. Vision-based Navigation with Language-based Assistance (VNLA) is a recently proposed task that requires an agent to navigate and find a target object according to a high-level language instruction. Due to the lack of step-by-step navigation guidance, the key to VNLA is to conduct goal-oriented exploration. In this paper, we design an Attention-based Knowledge-enabled Cross-modality Reasoning with Assistant’s Help (AKCR-AH) model to address the unique challenges of this task. AKCR-AH learns a generalized navigation strategy from three new perspectives: (1) external commonsense knowledge is incorporated into visual relational reasoning, so as to take proper action at each viewpoint by learning the internal–external correlations among object- and room-entities; (2) a simulated human assistant is introduced in the environment, who provides direct intervention assistance when required; (3) a memory-based Transformer architecture is adopted as the policy framework to make full use of the history clues stored in memory tokens for exploration. Extensive experiments demonstrate the effectiveness of our method compared with other baselines.

Keywords:

commonsense knowledge; simulated human assistant; direct intervention; memory-based Transformer; cross-modality reasoning

1. Introduction

Recent advances in Computer Vision (CV) and Natural Language Processing (NLP) have garnered growing interest in developing general-purpose AI systems. Mapping natural language instructions combined with the visual environment to actions is vital for developing robotic and embodied agents that can mimic human behavior in the real world. Vision-and-Language Navigation (VLN) is this type of task, which requires an agent to correctly navigate to the goal location in a photo-realistic simulation environment by following natural language instructions. Specifically, the agent is randomly embodied in a realistic 3D environment, such as that from the Matterport 3D Simulator [1], and given a task instruction. At each time step t, the agent observes its surrounding environment and moves around until reaching the end point.

Various methods have been proposed to address the VLN task. With the rise of BERT-based models, recent work on VLN has shown great performance by directly modeling cross-modal relationships with Transformer [2]. There is also interactive VLN work based on dialogues, where an agent may ask for guidance and an oracle could respond to direct the agent to the destination [3]. These navigation approaches focus on navigating to the target by detailed step-by-step instructions. In real-world applications, people often prefer to provide concise instructions and expect robots to be capable of identifying objects in the visual content and infer their relationships for self-exploration and autonomous decision-making. Ref. [4] proposed a valuable task named Vision-based Navigation with Language-based Assistance (VNLA) to further facilitate the general embodied-AI field. The goal of VNLA is to enable an agent in a realistic 3D environment to find the target object according to the given high-level instruction and obtain additional language assistance when navigation is difficult. The language-based assistance from the advisor simulates actual human help, improving the agent’s navigation performance. However, some challenges are going to be tackled in achieving the VNLA task: (1) VNLA towards practicability only annotates high-level instructions such as “Find an armchair in the living room”. This is more natural and closer to the needs of human daily life, but also more challenging. The agent needs to conduct goal-oriented exploration in the environment, rather than strict instruction-following. (2) It may be hard for an agent in the task to find the exact location of a target object through using only the intuition formed by the visual and linguistic multimodal information it receives (since this is hard even for humans). In particular, applying internal–external correlations among rooms and objects learned from limited environments to previously unseen scenes is not trivial, since there is no specific regularity for target object placement. (3) When the agent makes mistakes in understanding the language assistance, i.e., the advisor’s original intent, it will deviate from the trajectory suggested by the advisor, further increasing the possibility of failure. (4) In real life, a robot that asks humans for help too frequently or at inappropriate times may annoy them to the point that they stop helping. On the other hand, helping a robot too frequently will cause its autonomy to drop dramatically, although the task success rate will increase. Therefore, the robot needs to improve its autonomous decision-making performance and task success rate as much as possible with limited human assistance.

Efficiently solving the above-mentioned challenges can result in many potential applications. In this paper, we propose to combine the power of Transformer and external knowledge to improve the agent’s performance in VNLA. Firstly, inspired by [5], we design a memory-based Transformer with variable-length input as the policy module for VNLA, where the visual and linguistic clues constitute a scene memory token

m_{t}

at each time step t. The token

m_{t}

is stored in the memory bank M ordered by the time sequence, which allows explicitly modeling history information, and the Transformer architecture naturally accommodates variable-length memory inputs to predict the next action. The information clues are effectively extracted from history tokens for the current decision-making of the agent through a learnable multi-layer attention. Secondly, we bring external commonsense knowledge from ConceptNet [6] into the VNLA task for comprehensive room- and object-entity reasoning from visual and textual information before decision-making. Thirdly, we apply a simulated human assistant in the environment, which can provide direct intervention assistance when requested, so as to realize “Human-on-the-Loop” indirectly. This is also the other form of external knowledge introduction. Last but not least, we take

360^{\circ}

panoramic RGB images of the agent in the horizontal direction as the visual input, which allows the agent to have a wider field of view for more precise reasoning and more accurate navigation. Specifically, we present Attention-based Knowledge-enabled Cross-modality Reasoning with Assistant’s Help (AKCR-AH), a novel approach for incorporating commonsense knowledge from ConceptNet with simulated human assistance (knowledge) to help the agent find the target correctly. For the knowledge-enabled module, we perform room-entity and object-entity reasoning by learning the room-to-room correlations and internal–external parallel knowledge graph, respectively. We also consider the direct intervention of the simulated human assistant as a way of human knowledge involvement. For the attention-based module, we mainly design Multimodal Awareness and Internal-External Knowledge Fusion attention mechanisms, where the Multimodal Awareness attention mechanism having two branches to explicitly recognize rooms and objects from instruction and visual input separately, so as to bridge the cross-modal semantic gap between them. Integrating the above two will significantly improve the navigation performance of the agent and narrow the disparity between AI and humans in real life. Figure 1 illustrates a brief example of the AKCR-AH navigation process.

We conducted experiments on the ASKNAV dataset [4] for the VNLA task, and the final results show that AKCR-AH outperforms previous baseline methods. Besides, extensive ablation studies verify the contribution and parameter selection of each sub-component of our method.

2. Related Work

2.1. Natural-Language-Grounded Visual Navigation

Natural-language-grounded visual navigation tasks have drawn increasing research interests in recent years due to their practicality in real life and also pose great challenges for vision–language understanding tasks. Depending on communication complexity [7] between the agent and human, i.e., whether the navigation instruction is given once or multiple times, natural-language-grounded visual navigation tasks can be divided into two types: Vision-and-Language Navigation (VLN) and Vision-and-Dialog Navigation (VDN).

2.1.1. Vision-and-Language Navigation

VLN was first proposed by [1], where an agent with first-person view as observations, follows step-by-step natural language instruction to navigate through a 3D simulated environment to a goal location. Specifically, the navigation procedure can be viewed as a sequential decision-making process, where an agent is spawned at a random location and receives a task expressed in language instruction, then navigates to the destination following the instruction. The given language instruction describes the agent’s trajectory in detail, such as “Walk toward the bed. When you get to the bed. Turn right and exit the room. Continue straight and enter the room straight ahead. Wait near the sink”; it can be decomposed into several meaningful pieces by some rules, where each rule indicates a movement action, and the agent carries them out by formulating an action sequence [8]. Different from visual question answering [9] where the agent only faces a static image, VLN requires the agent to explore and understand the dynamic environment in order to learn grounding language instruction to both visual observations and actions. The VLN task is successful if the agent stops close to the target following the instruction within the specified time. A range of VLN methods has been proposed to solve the VLN task. Ref. [10] proposed the RCM approach to enforce cross-modal grounding both locally and globally via a matching critic, which provides rewards for reinforcement learning to enhance the alignment of instructions and trajectories. Ref. [11] designed a visual–textual co-grounding module to highlight the instruction for the next action through visual observations and a progress monitor to reflect the progress. Ref. [12] used adversarial attacking to capture key information from language instructions for improving the robustness of navigation. To overcome the limited seen environment, [13] proposed a speaker-follower model to produce synthetic instructions for data augmentation and pragmatic inference, and another approach EnvDrop [14] increases the diversity of synthetic data by randomly removing objects to produce a new “unseen environment”. Ref. [15] presented the EnvEdit method based on [14], which improves the generalization ability of the agent in unseen environments by editing the existing environments for data augmentation. Ref. [16] proposed a modular approach to VLN using topological maps, which uses attention mechanisms to predict a navigation plan in the map by the given natural language instruction and topological map. In recent years, with the development of natural language processing technology, Transformer has been successfully applied in VLN [4,17,18,19,20] to improve navigation performance. Unlike the VLN of providing step-by-step instructions, some navigation tasks for localizing a remote object with high-level instructions have been presented, such as REVERIE [21] and SOON [22]. In REVERIE, an agent is required to find a remote object in another room that it is unable to see at the beginning. SOON designs a graph-based exploration method to achieve remote object navigation tasks. In order to study the agents’ inner mechanisms for navigation decisions, Ref. [23] examined how the agents understand multimodal information by conducting ablation diagnostic experiments.

2.1.2. Vision-and-Dialog Navigation

Building an agent that is able to efficiently interact with humans in dynamic environments is a long-term goal of the AI community. In daily life, when humans are on unfamiliar streets, they usually seek help and continue to navigate according to the responses of other humans. Of course, it is now possible to rely on the navigation map in a mobile phone to obtain a more accurate route, but in an urban area with dense buildings or the battlefield environment where communication suffers from electromagnetic interference, the map will usually be inaccurately positioned or even not work. Therefore, the ability of an agent to reason through dialogue will be a reliable paradigm with strong robustness in practical applications. Navigation from Dialog History (NDH) is a task proposed on the CVDN dataset [3], which requires an agent to navigate according to the dialog history, which consists of several question–answer pairs. The instruction in the CVDN dataset is longer and more complicated than those on VLN, making it harder for the agent to understand and perform visual grounding. Ref. [24] proposed the Cross-modal Memory Network (CMN) to capture the hierarchical correlation between the dialogue turns and the sub-trajectories. Ref. [25] introduced NDH-FULL to provide enough supervision for agent’s learning to improve path fidelity. Ref. [26] presented an LED task to determine the agent’s real-time location from dialog history. Ref. [27] designed TEACh, involving follower navigation and object interaction, as well as free-form dialog with the commander. VNLA [4] and HANNA [28] consider the object-finding task that allows an agent to request help from the oracle when it gets lost. Different from NDH, which provides the global dialog history as the input instruction, VLNA and HANNA offer an environment where instructions are dynamically changed by the situation. This allows humans to help the agent when it gets lost in testing time, enabling “human-on-the-loop” policy deployment, which is currently the most reliable paradigm for implementing AI in real life.

2.2. Vision–Language Reasoning with External Knowledge

External knowledge based on the knowledge graph is necessary for correct reasoning to achieve accurate navigation in the VLN tasks. From the relationships among objects, it is possible to capture the correlation between the meaning of the scene and the agent’s egocentric viewpoint [29]. Commonly used knowledge graphs such as ConceptNet [6] and DBpedia [30] apply nodes and edges to represent the concepts and relationships of objects, respectively. Graph Neural Networks (GNNs) [31] can represent knowledge in structured form, which enables interaction within visual and linguistic features. Ref. [32] built commonsense layouts for path planning and enforced the semantic grounding of scenes as auxiliary tasks at each step, then updated the semantically grounded navigator in unseen environments for better generalization. Ref. [33] introduced the Text-KVQA dataset and took the external knowledge into TextVQA to perform reasoning. Ref. [34] presented the V2C dataset, which generates commonsense captions directly from videos to describe latent aspects such as intentions, effects, and attributes. Ref. [35] proposed KE-GAN to generate reasonable scene parsing results by using ConceptNet to calculate knowledge relation loss. Ref. [36] designed KERR to incorporate commonsense knowledge from ConceptNet for cross-modality reasoning. In addition, Ref. [37] built STLF, which is based on spatial–temporal–logical knowledge representation and object mapping, to enhance the reasoning abilities of the semantic web by introducing temporal and logical relations connecting physical objects during their mapping from reality to cyberspace. Thus, the knowledge-based reasoning techniques are intuitively beneficial for VLN tasks.

2.3. Multimodal Transformers

Following the success of Transformer-based architectures on classification and generation tasks such as language and images, multimodal Transformer-style models have shown impressive results on several VLN tasks. Multimodal Transformer encodes multimodal information mainly including vision and language as a sequence of input tokens and appends them together to form a single input sequence. Additionally, a type embedding unique to each piece of modal information is added to distinguish amongst the input tokens of different modalities [38]. Ref. [39] designed PRESS to improve the language representation for VLN by using a pre-trained BERT encoder. Ref. [19] proposed PREVALENT to pretrain the Transformer in a self-supervised manner by using image–text–action triplets from the R2R dataset. Ref. [40] developed a Transformer-based model VLN-BERT on image–text pairs from the web to predict whether a trajectory matches an instruction. Ref. [41] developed R-VLNBERT, which can augment various VLBERT models with a recurrent function to model and leverage the history-dependent state representations for addressing partially observable inputs. Ref. [20] designed an object-informed sequential BERT to encode visual perceptions and linguistic instructions. Ref. [18] presented Airbert to improve the performance of VLN with in-domain training strategies. Recently, HAMT [42] and Episodic Transformer [43] explicitly modeled the history information by directly encoding all past observations and actions, but this is fairly complex. In contrast, MTVM [5] proposes a Transformer with a variable-length memory framework to model history information explicitly by copying past activations into a memory bank without the need to consider distances in the path.

2.4. Imitation Learning

Imitation learning is a framework for learning a behavior policy from teacher demonstration. Usually, demonstrations are presented in the form of state–action trajectories, with each pair indicating the action to take at the state being visited [44]. Classic imitation learning algorithms include Behavior Cloning [45] and Dagger [46]. There are many limitations in supervised learning, and the learning effect depends on the quality and scale of expert demonstration data, so some improved algorithms are proposed. Ref. [47] presented IRL to find the optimal action policy by learning a reward/cost function from expert data. However, the above algorithms all implicitly assume that the demonstrations are complete, which means each demonstrated state is fully observable and its actions are available. This assumption hardly applies to a real imitation learning task. Some work focuses on relaxing unrealistic assumptions on the teacher to alleviate this problem. Refs. [48,49] studied cases where teachers provided imperfect demonstrations. Ref. [4] introduced an I3L framework, in which the agent can actively ask to change the environment in the form of language instructions to facilitate its learning process. Ref. [50] constructed a policy to minimize the number of times that an agent queries the teacher. In addition, GAIL [51] learns expert actions by introducing the idea of GAN [52] into imitation learning.

3. Method

As robots are increasingly deployed in human daily life, situations will arise that they are not fully equipped to handle. To further analyze the human–robot relationship in society, Ref. [53] proposed a concept of a robot’s living space. When a robot performs tasks in an unseen scene, due to environmental uncertainties and the lack of domain knowledge, the robot might get lost or even take some dangerous actions. At present, knowledge guidance and human–robot collaboration are two important ways to address the above challenges. Knowledge has positive implications for enhancing data-driven machine learning. Knowledge-enhanced machine learning models can significantly reduce the dependence on large sample data and improve the utilization of prior knowledge. As a large-scale form of knowledge representation, the knowledge graph plays an important role in realizing machine cognitive intelligence such as natural language understanding. It is also critical for machine learning models to have the ability to explicitly model the long sequences of history information for correct decision-making. Human–robot collaboration refers to integrating human intelligence into the robot’s decision-making process to improve the task success rate. Human–robot collaboration can be mainly divided in two ways: “human-in-the-loop” and “human-on-the-loop”, according to the robot’s autonomous ability. “Human-in-the-loop” means that the robot is fully controlled by humans during the task. “Human-on-the-loop” means that the robot can perform autonomous decision-making during the task, but when the decision-making is wrong or difficult, the robot can communicate with humans autonomously, allowing them to intervene. In contrast, the “human-on-the-loop” learning method with human assistance can simultaneously improve both the robot’s autonomy and task success rate accompanied by the reduction in communication complexity and is a reliable paradigm for AI. In this section, we introduce our proposed method AKCR-AH from three aspects: knowledge guidance, Transformer with variable-length memory, and simulated human assistance.

3.1. Problem Setup and Overview

3.1.1. Setup

Our goal is to design a VLN agent that is capable of reasoning with the knowledge graph and requesting additional help when in need. The agent receives an initial language instruction and real-time visual observations as input signals, but does not know the map of the environment or its own specific location. The language instruction I is formulated as a set of words

\{ω_{0}, \dots, ω_{L}\}

, where L is the number of words. The real-time visual observation is a

360^{\circ}

panoramic view, including 36 image views

{\{o_{t, i}\}}_{i = 1}^{36}

, and each view

o_{t, i}

is composed of an RGB image

v_{t, i}

and its orientation

(θ_{t, i}, φ_{t, i})

, where

θ_{t, i}

and

φ_{t, i}

are the heading and elevation angles, respectively. The agent is initially spawned at a random location in the scene. A requester, simulating the real human user, asks the mobile agent to find an object in a specific room by sending a high-level language command (“Find [object(s)] in [room]”). There is at least a target object instance in the environment that satisfies the end-goal, making the task always feasible. During the specified time steps, if the agent stays within

R_{S}

meters from the target object along the shortest path, the task is considered to have been successfully fulfilled. Here,

R_{S}

represents the length of the success radius, a task-specific hyperparameter. There is a simulated human assistant that the agent can turn to in the environment, who is present at both the training and evaluation time, providing the agent with single-step teacher decision-making through direct intervention. During execution, the agent is able to decide whether to ask the simulated human assistant for help through prior general knowledge, if the query budget allows. The query budget is a hyperparameter that determines the number of times the agent can ask.

In our example, as shown in Figure 1, the agent is lost and asks for help at Position 2; the simulated human assistant gives a “go forward” action so that the agent can navigate along the hallway to approach the dining room. Direct intervention assistance may not ensure that the end-goal is fulfilled, but is guaranteed to get the agent closer to the target location.

3.1.2. Overview

Here, we consider an embodied agent that learns to navigate inside real indoor environments by incorporating external commonsense knowledge and a simulated human assistant. The AKCR-AH framework mainly consists of three parts (see Figure 2): an Attention-based Knowledge-enabled Cross-modality Reasoning (AKCR) module, a memory-based Transformer and a simulated human assistant. Specifically, at time step t, AKCR extracts visual and linguistic features for cross-modal knowledge reasoning to generate scene memory token

m_{t}

. Then, the memory-based Transformer takes history scene memory token sequence

{\{m_{i}\}}_{i = 0}^{t - 1}

and current token

m_{t}

as the input to produce hidden state

h_{t}

for predicting the action distribution

p_{t}

. When the agent has difficulty choosing the action

a_{t}

from

p_{t}

or makes mistakes, the simulated human assistant will provide assistance through direct intervention.

3.2. Attention-Based Knowledge-Enabled Cross-Modality Reasoning

In this section, inspired by [36], we present an Attention-based Knowledge-enabled Cross-modality Reasoning (AKCR) model. Specifically, the language attention module encodes the language instruction I, then extracts room- and object-related language features. For navigation view

v_{t}

, AKCR detects an object set

X_{v_{t}}

through the object detector D and uses it to directly predict the probability distribution of room type

P_{v_{t}}^{R}

. Then, the object set

X_{v_{t}}

is taken as the index to sample from ConceptNet to construct internal and external subgraphs

K G_{v_{t}}^{I}

and

K G_{v_{t}}^{E}

, respectively, for parallel multi-step object-entity dynamic reasoning. Finally, the internal and external knowledge graphs are integrated through the attention mechanism to generate object-level feature

F_{t}^{o}

.

3.2.1. Language Attention Module

To distinguish and extract different types of information in language instructions, we design two language attention networks for room- and object-related features, respectively. The initial task instruction I is first encoded by the Transformer-encoder accompanied by a sequence position embedding to generate encoded language representation

\hat{I} = {\{{\hat{e}}_{i}\}}_{i = 1}^{L}

, where

e_{i}

represents the token vector of each word and L is the length of instruction. Then, the encoded representation

\hat{I}

is fed into the room- and object-aware attention networks, respectively, to extract room- and object-related language features

{\hat{I}}_{t}^{R}

and

{\hat{I}}_{t}^{o}

at each time step t. Specifically, the calculation process of the two networks is as follows:

{\hat{I}}_{t}^{R} = \sum_{i = 1}^{L} \frac{exp ({\hat{e}}_{i} W_{R} h_{t - 1}^{⊤})}{\sum_{j = 1}^{L} exp ({\hat{e}}_{j} W_{R} h_{t - 1}^{⊤})} {\hat{e}}_{i}

{\hat{I}}_{t}^{o} = \sum_{i = 1}^{L} \frac{exp ({\hat{e}}_{i} W_{O} h_{t - 1}^{⊤})}{\sum_{j = 1}^{L} exp ({\hat{e}}_{j} W_{O} h_{t - 1}^{⊤})} {\hat{e}}_{i}

where

W_{R}

and

W_{O}

denote the parameter of room and object attention networks and

h_{t - 1}

is the Transformer-decoder hidden state.

3.2.2. Vision Attention Module

The vision attention module utilizes a Faster R-CNN [54] pre-trained on VG [55] to recognize the room/object information from visual observations. At each time step t, AKCR uses Faster R-CNN as the detector to detect

N_{v_{t}} (\leq 100)

most salient objects from visual observations

v_{t}

, forming an object set

X_{v_{t}}

, where

|X_{v_{t}}| = N_{v_{t}}

. In addition, AKCR applies a feature extractor [56] to obtain the image feature

{\hat{v}}_{t}

of visual observation

v_{t}

, and then,

{\hat{v}}_{t}

is combined with the Transformer-decoder’s hidden state

h_{t - 1}

to generate view-level feature

F_{t}^{v}

through an attention mechanism:

F_{t}^{v} = \sum_{i = 1}^{6} \frac{exp ({\hat{v}}_{t, i} W_{v} h_{t - 1}^{⊤})}{\sum_{j = 1}^{6} exp ({\hat{v}}_{t, j} W_{v} h_{t - 1}^{⊤})} {\hat{v}}_{t, i}

where

v_{t}

is composed of the six action directions

{l e f t, r i g h t, u p, d o w n, f o r w a r d, s t o p}

of an agent during navigation and

W_{v}

is a learnable parameter.

3.2.3. Attention-Based Internal–External Parallel Knowledge Graph Reasoning

We design the AIEPKR module for room- and object-relation reasoning, where the object-relation reasoning is iteratively carried out by two parallel branches, i.e., external knowledge graph reasoning and internal task-specific knowledge graph reasoning using an attention mechanism to fuse, as shown in Figure 3.

A. External knowledge graph reasoning.

AIEPKR takes the top-K query to retrieve the most relevant K knowledge facts for each category of objects in the task from ConceptNet to construct an external knowledge graph

K G^{E} = (X^{E}, E^{E})

, where

X^{E} = {\{x_{i}\}}_{i = 1}^{N_{E}}

is the node set and

E^{E} = {\{ε_{i j}\}}_{i, j = 1}^{N_{E}}

is the edge set. Each node

x_{i}

corresponds to an object

o_{i}

, and each edge

ε_{i j}

denotes the relationships between objects

o_{i}

and

o_{j}

. Besides,

H^{E} \in R^{N_{E} \times D_{w}}

denotes the node feature matrix, and

A^{E} \in R^{N_{E} \times N_{E}}

represents the weighted adjacency matrix, in which each element

A_{i, j}^{E}

is pre-trained. The multi-step reasoning process of external knowledge is as follows:

\{\begin{matrix} H^{E (k)} = δ (A^{E} H^{E (k - 1)} W^{E (k)}) \\ H^{E (0)} = H^{E} \end{matrix}

where k represents the kth step of the graph convolution process using the Graph Convolutional Network (GCN),

δ (\cdot)

is the activation function, and

W^{E (k)}

is a learnable parameter.

B. Internal knowledge graph reasoning.

The internal knowledge graph

K G^{I} = (X^{I}, E^{I})

includes only 1600 categories that are able to be detected by the detector D, dynamically learning domain-specific knowledge from the dataset.

H^{I} \in R^{1600 \times D_{w}}

and

A^{I} \in R^{1600 \times 1600}

represent the node feature matrix and weighted adjacency matrix of

K G^{I}

, respectively, where

A^{I}

is learnable. At time step t, the module samples from

K G^{I}

according to the detected object set

X_{v_{t}}

as an index to construct a fully connected subgraph

K G_{v_{t}}^{I} = (X_{v_{t}}^{I}, E_{v_{t}}^{I})

.

H_{v_{t}}^{I} \in R^{N_{v_{t}} \times D_{w}}

denotes the node feature matrix, and

A_{v_{t}}^{I} \in R^{N_{v_{t}} \times N_{v_{t}}}

denotes the learnable adjacency matrix. Similarly,

H_{v_{t}}^{I}

and

A_{v_{t}}^{I}

are sub-matrices of

H^{I}

and

A^{I}

, respectively.

K G_{v_{t}}^{I}

dynamically extracts external knowledge from

K G^{E}

to enhance and correct internal knowledge reasoning. For the external knowledge extraction, also take

X_{v_{t}}

as the index to sample a sub-node feature matrix

H_{v_{t}}^{E (k)}

from

H^{E (k)}

. Then, we use an attention mechanism to fuse

H_{v_{t}}^{I (k)}

with

H_{v_{t}}^{E (k)}

, which is formulated as:

α_{k, i}^{E} = \frac{exp (H_{v_{t}}^{E (k)} [i, :] W_{h} h_{t - 1}^{⊤})}{exp (H_{v_{t}}^{I (k)} [i, :] W_{h} h_{t - 1}^{⊤}) + exp (H_{v_{t}}^{E (k)} [i, :] W_{h} h_{t - 1}^{⊤})}

α_{k, i}^{I} = \frac{exp (H_{v_{t}}^{I (k)} [i, :] W_{h} h_{t - 1}^{⊤})}{exp (H_{v_{t}}^{I (k)} [i, :] W_{h} h_{t - 1}^{⊤}) + exp (H_{v_{t}}^{E (k)} [i, :] W_{h} h_{t - 1}^{⊤})}

{\bar{H}}_{v_{t}}^{I (k)} [i, :] = α_{k, i}^{I} H_{v_{t}}^{I (k)} [i, :] + α_{k, i}^{E} H_{v_{t}}^{E (k)} [i, :]

where

M [i, :]

represents the i-th row of the matrix and

W_{h}

is a learnable parameter. The multi-step reasoning process of internal knowledge is as follows:

\{\begin{matrix} H_{v_{t}}^{I (k)} = δ (A_{v_{t}}^{I} {\bar{H}}_{v_{t}}^{I (k - 1)} W^{I (k)}) \\ H_{v_{t}}^{I (0)} = H_{v_{t}}^{I} \end{matrix}

In addition, the correlation calculation between the object-related language feature

{\hat{I}}_{t}^{o}

and each category

H_{i}^{E}

in the external knowledge graph is performed through an attention mechanism to make

{\hat{I}}_{t}^{o}

integrate the object-level clues from external knowledge, which is formulated as:

{\hat{I}}_{t}^{o^{'}} = \sum_{i = 1}^{N_{E}} \frac{exp (H_{i}^{E} W_{f} {\hat{I}}_{t}^{o^{⊤}})}{\sum_{j = 1}^{N_{E}} exp (H_{j}^{E} W_{f} {\hat{I}}_{t}^{o ⊤})} H_{i}^{E}

To obtain the final object-level feature

F_{t}^{o}

, the module takes knowledge-enhanced

{\hat{I}}_{t}^{o^{'}}

to further attend to

H_{v_{t}}^{I (K)}

:

F_{t}^{o} = {[\frac{exp ({({\hat{I}}_{t}^{o^{'}} W_{o} H_{v_{t}}^{I {(k)}^{⊤}})}_{i})}{\sum_{j = 1}^{N_{v_{t}}} exp ({({\hat{I}}_{t}^{o^{'}} W_{o} H_{v_{t}}^{I {(k)}^{⊤}})}_{j})}]}_{i = 1}^{N_{v_{t}}} \cdot H_{v_{t}}^{I (K)}

where

H_{v_{t}}^{I (K)}

represents the final node feature matrix.

3.2.4. Room Relation Reasoning

It is important for an agent to correctly navigate to the target room by sensing the room-to-room correlation. AIEPKR first extracts the linguistic and visual room-aware features

P_{I}^{R}

and

P_{v_{t}}^{R}

from room-related language instruction

{\hat{I}}_{t}^{R}

and visual observation

v_{t}

, respectively, which provides room-level information clues for further action reasoning. For linguistic awareness,

{\hat{I}}_{t}^{R}

passes through a fully connected layer to directly predict the target room type’s probability distribution in the instruction:

P_{I}^{R} = {\{p_{j}\}}_{j = 1}^{N_{R}}, p_{j} = \frac{exp (F C {({\hat{I}}_{t}^{R})}_{j})}{\sum_{k = 1}^{N_{R}} exp (F C {({\hat{I}}_{t}^{R})}_{k})}

where

N_{R}

is the number of room types.

For visual awareness, since the room type is very relevant to objects placed in it, AIEPKR takes the object set

X_{v_{t}}

to predict the probability distribution of the room type corresponding to

v_{t}

through a fully connected layer, which is formulated as:

P_{v_{t}}^{R} = {\{p_{j}\}}_{j = 1}^{N_{R}}, p_{j} = \frac{exp (F C {(X_{v_{t}})}_{j})}{\sum_{k = 1}^{N_{R}} exp (F C {(X_{v_{t}})}_{k})}

Then, AIEPKR enables the agent to have room reasoning ability by learning a room-to-room correlation matrix

A^{R}

, in which each element

A_{i, j}^{R}

represents the confidence that the agent can reach the jth room type from the ith room type. The confidence score is generated via:

s_{t} = P_{I}^{R} A^{R} P_{v_{t}}^{⊤}

For the agent’s six optional actions during navigation,

s_{t}

calculates their corresponding confidence scores

s_{t, i}

, respectively, and repeats to form the confidence feature

c_{t, i} \in R^{1 \times 128}

, which represents the confidence degree that the agent can effectively reach the target room by selecting

v_{t, i}

as the next direction. Finally, derive

F_{t}^{R}

via:

F_{t}^{R} = {\{c_{t, i}\}}_{i = 1}^{6}

3.3. Memory-Based Transformer

Remembering history information is essential for an agent to implement correct decision-making during navigation. Inspired by [5], we introduce a memory-based Transformer module as the action policy framework to explicitly model history information. The information generated during reasoning is stored in the form of scene memory token

m_{t}

, where

m_{t}

is composed of multiple features concatenated:

m_{t} = [F_{t}^{v}, F_{t}^{o}, {\hat{I}}_{t}^{R}, {\hat{I}}_{t}^{o}]

The history scene memory tokens are stored in the memory bank M by temporal order. At each time step t, the module first takes history scene memory token sequence

{\{m_{i}\}}_{i = 0}^{t - 1}

from M and current token

m_{t}

as the input to generate hidden state

h_{t}

:

h_{t} = f (m_{0}, m_{1}, \dots, m_{t})

where f represents the function of the Transformer. The agent’s visual observation

v_{t}

is concatenated with its corresponding confidence degree

c_{t}

, so that the decision-making process incorporates room-level information. Then, concatenation and

h_{t}

are used to predict action

a_{t}

through an attention mechanism, which is formulated as:

p_{t} = softmax ([v_{t}, c_{t}] W_{a} h_{t}^{⊤})

a_{t} = argmax (p_{t})

where

W_{a}

is a learnable parameter. At the end of each time step, the memory bank M is updated by appending the current token

m_{t}

.

3.4. Modeling Human Help

We model human help by applying a simulated human assistant in the environment. The simulated human assistant guides an agent to take actions during both training and test time through direct intervention. When an agent sends the request signal, the simulated human assistant will overwrite the agent’s decision-making with its own, making the agent take the action that the simulated human assistant wants, so direct intervention is always perfectly executed. The simulated human assistant has an oracle-like function that always chooses action

a_{t}^{D}

along the shortest path from the current location to the goal location at each time step t:

a_{t}^{D} = π_{a s s}^{*} (s_{t})

where

π_{a s s}^{*}

is the simulated human assistant’s policy and

s_{t}

is the current environment state.

To fit real-world applications, an agent should be committed to adaptively decide whether to ask for help during navigation [57]. In our design, the agent is able to ask for help from the simulated human assistant when heuristic-based rules are satisfied and the query budget is greater than 0. Inspired by [4], we introduce 4 heuristic rules as query policy

π_{ask}

to decide when the agent can ask for help:

The agent deviates from the navigation teacher path by more than $δ$ meters. The distance is defined as the length from the agent’s current viewpoint to the closest viewpoint on the path.
The agent is “confused”, defined as when the difference between the entropy of the uniform navigation distribution and the entropy of the navigation distribution calculated by the agent is less than a threshold $ϵ$ .
The agent has remained at the same viewpoint for the last $μ$ steps.
The agent is at a goal viewpoint, but the highest-probability action of the navigation distribution is not stop.

As the agent’s navigation performance is improved, the number of help requests to the simulated human assistant should also be reduced to improve the agent’s autonomy. The query budget

B_{t}

and the agent’s probability of asking

ρ_{a s k, t}

are two independent variables, which together determine a dependent variable, assistant_help:

ρ_{a s k, t} = \{\begin{matrix} 1, if a or b or c or d > 0 \\ 0, else \end{matrix}

B_{t} = \{\begin{matrix} B_{t - 1} - 1, & if B_{t - 1} > 0 and ρ_{a s k, t - 1} = 1 \\ B_{t - 1}, & if B_{t - 1} > 0 and ρ_{a s k, t - 1} = 0 \\ 0, & else \end{matrix}

a s s i s t a n t_h e l p = \{\begin{matrix} 1, if B_{t} > 0 and ρ_{a s k, t} > 0 \\ 0, else \end{matrix}

We set the maximum query budget

B_{0} = 10

as a study design consideration to balance human participation while having a controllable condition in the experiments.

At time step t, if

a s s i s t a n t_h e l p

is equal to 1, i.e., the simulated human assistant can receive a help request from the agent, it will use the teacher action to overwrite the agent’s decision-making:

a_{t} = a_{t}^{D}

as shown in the upper right part of Figure 1.

In the experiments section, the results demonstrate that the agent’s navigation performance will be improved by indirectly implementing the “Human-on-the-loop” function.

4. Environment and Data

4.1. Matterport3D Simulator

Matterport3D simulator [1] is a kind of large-scale machine learning research platform, which is based on the Matterport3D dataset [58] for the development and research of navigation agents in indoor environments. The Matterport3D dataset contains 90 building-scale scenes collected from reality, consisting of 194,400 RGB-D images, annotated at each scene through 3D reconstruction and 2D and 3D semantic segmentation. In the simulator, an agent navigates by iteratively selecting adjacent nodes from a pre-defined environment graph and adjusting the camera pose. At each viewpoint, the agent returns a rendered color image that captures the current observation [21]. The agent’s action space in the simulator contains six actions: left, right, up, down, forward, and stop. The simulator does not define or constrain the agent’s goals, reward functions, or any other additional context.

4.2. Dataset

We conducted our approaches on the ASKNAV dataset [4]. The ASKNAV dataset provides high-level language instructions describing only the end goal of the task, constructed as “Find [O] in [R]”, where [O] is the object label and [R] is the room label, with quantities 289 and 26, respectively. The dataset is split into the training set consisting of 61 environments with 94,798 instructions, the validation set consisting of 11 environments with 4874 instructions in the seen set and 5005 instructions in the unseen set, and the test set composed of the remaining 18 environments, including the seen set with 4917 instructions and the unseen set with 5001 instructions.

5. Implementation

5.1. Notation

The agent maintains two policies: navigation policy

π_{n a v}

and query policy

π_{a s k}

. The navigation policy is stochastic and outputs a distribution P over its action space. During navigation, an action a is determined by selecting the maximum probability action of or randomly sampling from P in real time. The query policy is based on general knowledge of heuristic rules. The agent is supervised by the navigation teacher

π_{nav}^{*}

during training and assisted by the simulated human assistant

π_{ass}^{*}

when needed. Each data point in the dataset consists of a start viewpoint

X_{d}^{start}

, a start orientation

ψ_{d}^{start}

, a set of goal viewpoints

\{X_{d, i}^{end}\}

, an end-goal

e_{d}

, and the full map

M_{d}

of the corresponding environment. At any time, the teacher and the simulated human assistant have access to the agent’s current pose and information provided by the current data point [4].

5.2. Agent

The navigation policy framework is a Transformer-decoder module, which takes the token sequence

\{m_{0}, m_{1}, \dots, m_{t}\}

generated in the AIEPKR module as the input to decode a series of actions

\{a_{0}, a_{1}, \dots, a_{t}\}

. At time step t, if the agent asks for help from the simulated human assistant, its action

a_{t}

will be covered as assistant’s action

a_{t}^{D}

through direct intervention:

a_{t} = \{\begin{matrix} a_{t}^{D}, & if a s s i s t a n t_h e l p = 1 \\ a_{t}, & else \end{matrix}

5.3. Teacher

The navigation teacher always takes actions corresponding to the shortest route from the current viewpoint to the goal viewpoints. Given the initial pose of an agent, the navigation teacher first adjusts the heading and elevation angles using camera-adjusted actions (

l e f t

,

r i g h t

,

u p

,

d o w n

) until the

f o r w a r d

action is selected to advance the agent to the next viewpoint on the shortest route. When the distance from the target object is less than the success radius

R_{s}

, the teacher issues the

s t o p

action.

5.4. Learning

Due to the implementation of the simulated “human-on-the-loop” function, the agent learns the policy by mixing imitation learning with direct intervention. When the agent asks for help, the simulated human assistant will use the single-step teacher action for direct intervention. The simulated human assistant can be used during training and testing, while the imitation learning teacher can only be used during training. In direct intervention, the agent always takes the teacher actions. In most imitation learning algorithms, the agent learns with a mixed policy, which is equivalent to making use of a Bernoulli distribution sampler to decide whether the teacher should intervene. The learning objective of the agent is to minimize the expected loss of its induced state distribution:

{\hat{π}}_{nav} = arg min_{π_{nav}} E_{s \sim p} [L (s, π_{n a v}, π_{a s k})]

where

L

is the loss function and

p

is the agent-induced state distribution.

5.5. Training

Our training objective is composed of two different parts: the imitation learning loss and the room type classification loss.

5.5.1. Imitation Learning Loss

We used the student force training strategy. At time step t, the model predicts the agent’s probability distribution

p_{t, a}

over the action space and the teacher action

a_{t}^{*}

, then selects the agent’s action

a_{t}

by randomly sampling

p_{t, a}

. The loss of imitation learning is defined as follows:

L_{I L} = \sum_{t = 1}^{T} - a_{t}^{*} log (a_{t})

5.5.2. Room Classification Loss

The room classification loss includes two parts: language classification loss and visual classification loss. We term

R^{*}

as the ground-truth goal room type and

R_{t}^{*}

as the ground-truth room type of the agent’s optional action directions at each time step t. The loss of room classification is defined as follows:

L_{R} = \sum_{t = 1}^{T} - [P_{I}^{R^{*}} (log P_{I}^{R}) + P_{v_{t}}^{R_{t}^{*}} (log P_{v_{t}}^{R_{t}})]

5.5.3. Total Loss

The final objective is defined as:

L = λ_{1} L_{I L} + λ_{2} L_{R}

where

λ_{i} (i = 1, 2)

is the trade-off weight for total loss and T is the final navigation time step.

6. Experiments

6.1. Experimental Setup

We compared our designed AKCR-AH model with the following baseline polices:

No Query (NQ): never asks for help.
No Knowledge (NK): does not introduce knowledge graph for reasoning.
No Query and Knowledge (NQK): does not have both of the above capabilities.
VNLA-Direct: follows VNLA with direct intervention [4].

In our experiments, we only considered the realization of the simulated “human-on-the-loop” function through direct intervention, because in practical tasks, human’s direct intervention is often more effective and quicker for improving the decision-making performance of human–machine collaboration, such as unmanned driving and air battle.

6.1.1. Evaluation Metrics

We mainly applied three metrics, Success Rate (SR), Room-finding Success Rate (RSR), and Navigation Error (NE), to evaluate the performance. The Success Rate (SR) is the percentage of reaching the position within a certain threshold distance from the target. The Room-finding Success Rate (RSR) shows how many times the final position is in the goal room type. Navigation Error (NE) measures the distance between the target and the stop location. Specifically, the Success Rate (SR) is the most key metric for the task.

6.1.2. Implementation Details

Our AKCR-AH model was implemented using Pytorch and trained on a single Tian Rtx GPU for 200,000 iterations. The batch size was set to 100 throughout the training, and the Adam optimizer with a learning rate of

10^{- 4}

was applied for updating. Images in the scenes were encoded by the pre-trained ResNet-152. The channel dimensions of the features were set to

D_{w} = 300

and

D_{h} = 512

. The number of actions provided by simulated human assistant each time (k) was 1, and the success radius (

R_{s}

) of the task was fixed at 2 m. The training loss function was set to

λ_{1} = 1

and

λ_{2} = 0.8

. The value of top-K was equal to 5. In the heuristic-based rules, the deviation threshold (

δ

), uncertainty threshold (

ϵ

), and non-moving threshold (

μ

) were 8, 1, and 9, respectively.

6.2. Results and Analysis

6.2.1. Learning Processes

We compared the learning curves of different baselines, as shown in Figure 4. After 15,000 iterations, the performance of AKCR-AH surpasses VNLA-Direct in SR and NE metrics, and gradually stabilizes. AKCR-AH takes about 5.5 days to complete 20,000 iterations of training.

6.2.2. Main Results

Our main test results are presented in Figure 5. Overall, empowering an agent with the ability to use external knowledge for reasoning and ask for help will greatly boost its performance. From Figure 5, we can find that our proposed AKCR-AH model improves the SR and NE metrics on the seen VNLA tasks, demonstrating the importance of incorporating commonsense knowledge and realizing the simulated “human-on-the-loop” function in VLN tasks. Specifically, we improved [SR, NE] on Test Seen by [2.36%,

- 0.29

m] compared to VNLA-DIRECT (the lower NE means better model performance), but there was still a certain gap on Test Unseen. The results of the other three ablation baseline models in Figure 5 examine the contribution of each component of AKCR-AH, further demonstrating that incorporating external knowledge graph and direct intervention assistance is more beneficial to improve the agent’s navigation performance. Note that since success rates tend to take a long time to converge on Test Seen, we compared the agent that achieved comparable success rates on Val Seen (the difference in success rates is no more than 0.5%).

The relatively small gap between NK and AK indicates that the simulated human assistant module is more useful for improving the agent’s policy. NK far outperforms NA and NAK in all metrics on Test. The external knowledge reasoning module also plays an important role in the improvement of the policy. Compared with NAK, NA with the knowledge graph further improves all the metrics on Test. Our AKCR-AH blends the above two to achieve superior performance, outperforming the state-of-the-art VNLA baseline in the SR and NE metrics on Test Seen. On Test Unseen, our AKCR-AH still has a certain performance gap compared to VNLA-DIRECT. We think this may be because external knowledge reasoning still needs to improve generalization in unseen scenes and the limitation of computing resources. However, the obvious performance gain on Test Unseen proves that AKCR-AH takes benefits from prior knowledge and additional assistance.

7. Conclusions

In this paper, we propose an Attention-based Knowledge-enabled Cross-modality Reasoning with Assistant’s Help (AKCR-AH) model to address the unique challenges of the VNLA task. AKCR-AH mainly includes three parts: Attention-based Knowledge-enabled Cross-modality Reasoning (AKCR) module, memory-based Transformer, and simulated human assistant. AKCR extracts room/object clues from vision and language, then conducts room- and object-entity reasoning by applying graph-based knowledge. The memory-based Transformer framework is used as the action policy module, enabling the agent to explicitly leverage the history information to improve the accuracy of decision-making. Meanwhile, we also apply a simulated human assistant in the environment, which can assist the agent by direct intervention when needed, so as to realize the function of “Human-on-the-Loop” in a simulated way. Extensive experiments demonstrate the effectiveness of our proposed methods.

In the future, we will explore how to enable the agent to decide what type of information to ask, so that the simulated human assistant can provide more efficient assistance. Moreover, we will extend the model with an environment-agnostic learning framework to improve the generalization performance in unseen environments. Finally, we will also investigate how to transfer from the simulator to the real world.

Author Contributions

X.L. and Y.Z. proposed the method; X.L. and W.Y. designed and performed the experiments; X.L., J.L., and W.Y. analyzed the experimental data and wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.; Sünderhauf, N.; Reid, I.; Gould, S.; Van Den Hengel, A. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3674–3683. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Thomason, J.; Murray, M.; Cakmak, M.; Zettlemoyer, L. Vision-and-dialog navigation. In Proceedings of the Conference on Robot Learning, Cambridge, MA, USA, 16–18 November 2020; pp. 394–406. [Google Scholar]
Nguyen, K.; Dey, D.; Brockett, C.; Dolan, B. Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 12527–12537. [Google Scholar]
Lin, C.; Jiang, Y.; Cai, J.; Qu, L.; Haffari, G.; Yuan, Z. Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation. arXiv 2021, arXiv:2111.05759. [Google Scholar]
Speer, R.; Chin, J.; Havasi, C. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Gu, J.; Stefani, E.; Wu, Q.; Thomason, J.; Wang, X.E. Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions. arXiv 2022, arXiv:2203.12667. [Google Scholar]
Wu, W.; Chang, T.; Li, X. Visual-and-language navigation: A survey and taxonomy. arXiv 2021, arXiv:2108.11544. [Google Scholar]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2425–2433. [Google Scholar]
Wang, X.; Huang, Q.; Celikyilmaz, A.; Gao, J.; Shen, D.; Wang, Y.F.; Wang, W.Y.; Zhang, L. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 6629–6638. [Google Scholar]
Ma, C.Y.; Lu, J.; Wu, Z.; AlRegib, G.; Kira, Z.; Socher, R.; Xiong, C. Self-monitoring navigation agent via auxiliary progress estimation. arXiv 2019, arXiv:1901.03035. [Google Scholar]
Lin, B.; Zhu, Y.; Long, Y.; Liang, X.; Ye, Q.; Lin, L. Adversarial reinforced instruction attacker for robust vision-language navigation. arXiv 2021, arXiv:2107.11252. [Google Scholar]
Fried, D.; Hu, R.; Cirik, V.; Rohrbach, A.; Andreas, J.; Morency, L.P.; Berg-Kirkpatrick, T.; Saenko, K.; Klein, D.; Darrell, T. Speaker-follower models for vision-and-language navigation. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Tan, H.; Yu, L.; Bansal, M. Learning to navigate unseen environments: Back translation with environmental dropout. arXiv 2019, arXiv:1904.04195. [Google Scholar]
Li, J.; Tan, H.; Bansal, M. EnvEdit: Environment Editing for Vision-and-Language Navigation. arXiv 2022, arXiv:2203.15685. [Google Scholar]
Chen, K.; Chen, J.K.; Chuang, J.; Vázquez, M.; Savarese, S. Topological planning with Transformers for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11276–11286. [Google Scholar]
Hong, Y.; Wu, Q.; Qi, Y.; Rodriguez-Opazo, C.; Gould, S. Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1643–1653. [Google Scholar]
Guhur, P.L.; Tapaswi, M.; Chen, S.; Laptev, I.; Schmid, C. Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1634–1643. [Google Scholar]
Hao, W.; Li, C.; Li, X.; Carin, L.; Gao, J. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13137–13146. [Google Scholar]
Qi, Y.; Pan, Z.; Hong, Y.; Yang, M.H.; van den Hengel, A.; Wu, Q. The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1655–1664. [Google Scholar]
Qi, Y.; Wu, Q.; Anderson, P.; Wang, X.; Wang, W.Y.; Shen, C.; Hengel, A.v.d. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9982–9991. [Google Scholar]
Zhu, F.; Liang, X.; Zhu, Y.; Yu, Q.; Chang, X.; Liang, X. SOON: Scenario oriented object navigation with graph-based exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12689–12699. [Google Scholar]
Zhu, W.; Qi, Y.; Narayana, P.; Sone, K.; Basu, S.; Wang, X.E.; Wu, Q.; Eckstein, M.; Wang, W.Y. Diagnosing vision-and-language navigation: What really matters. arXiv 2021, arXiv:2103.16561. [Google Scholar]
Zhu, Y.; Zhu, F.; Zhan, Z.; Lin, B.; Jiao, J.; Chang, X.; Liang, X. Vision-dialog navigation by exploring cross-modal memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10730–10739. [Google Scholar]
Kim, H.; Li, J.; Bansal, M. NDH-Full: Learning and Evaluating Navigational Agents on Full-Length Dialogue. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6432–6442. [Google Scholar]
Hahn, M.; Krantz, J.; Batra, D.; Parikh, D.; Rehg, J.M.; Lee, S.; Anderson, P. Where are you? Localization from embodied dialog. arXiv 2020, arXiv:2011.08277. [Google Scholar]
Padmakumar, A.; Thomason, J.; Shrivastava, A.; Lange, P.; Narayan-Chen, A.; Gella, S.; Piramuthu, R.; Tur, G.; Hakkani-Tur, D. TEACh: Task-driven Embodied Agents that Chat. arXiv 2021, arXiv:2110.00534. [Google Scholar] [CrossRef]
Nguyen, K.; Daumé, H., III. Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. arXiv 2019, arXiv:1909.01871. [Google Scholar]
Park, S.M.; Kim, Y.G. Visual language navigation: A survey and open challenges. In Artificial Intelligence Review; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–63. [Google Scholar]
Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. Dbpedia: A nucleus for a web of open data. In The Semantic Web; Springer: Berlin/Heidelberg, Germany, 2007; pp. 722–735. [Google Scholar]
Sun, Q.; Li, J.; Peng, H.; Wu, J.; Ning, Y.; Yu, P.S.; He, L. Sugar: Subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism. In Proceedings of the Web Conference 2021, April 2021; pp. 2081–2091. [Google Scholar]
Yu, D.; Khatri, C.; Papangelis, A.; Madotto, A.; Namazifar, M.; Huizinga, J.; Ecoffet, A.; Zheng, H.; Molino, P.; Clune, J.; et al. Common Sense and Semantic-Guided Navigation through Language in Embodied Environment. ViGIL@ NeurIPS. 2019. Available online: https://openreview.net/forum?id=Bkx5ceHFwH (accessed on 8 June 2022).
Singh, A.K.; Mishra, A.; Shekhar, S.; Chakraborty, A. From strings to things: Knowledge-enabled vqa model that can read and reason. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4602–4612. [Google Scholar]
Fang, Z.; Gokhale, T.; Banerjee, P.; Baral, C.; Yang, Y. Video2commonsense: Generating commonsense descriptions to enrich video captioning. arXiv 2020, arXiv:2003.05162. [Google Scholar]
Qi, M.; Wang, Y.; Qin, J.; Li, A. Ke-gan: Knowledge embedded generative adversarial networks for semi-supervised scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 5237–5246. [Google Scholar]
Gao, C.; Chen, J.; Liu, S.; Wang, L.; Zhang, Q.; Wu, Q. Room-and-object aware knowledge reasoning for remote embodied referring expression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3064–3073. [Google Scholar]
Dhelim, S.; Ning, H.; Zhu, T. STLF: Spatial-temporal-logical knowledge representation and object mapping framework. In Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Budapest, Hungary, 9–12 October 2016; pp. 1550–1554. [Google Scholar]
Kant, Y.; Batra, D.; Anderson, P.; Schwing, A.; Parikh, D.; Lu, J.; Agrawal, H. Spatially aware multimodal Transformers for textvqa. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 715–732. [Google Scholar]
Li, X.; Li, C.; Xia, Q.; Bisk, Y.; Celikyilmaz, A.; Gao, J.; Smith, N.; Choi, Y. Robust navigation with language pretraining and stochastic sampling. arXiv 2019, arXiv:1909.02244. [Google Scholar]
Majumdar, A.; Shrivastava, A.; Lee, S.; Anderson, P.; Parikh, D.; Batra, D. Improving vision-and-language navigation with image-text pairs from the web. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 259–274. [Google Scholar]
Hong, Y.; Wu, Q.; Qi, Y.; Rodriguez-Opazo, C.; Gould, S. A Recurrent Vision-and-Language BERT for Navigation. arXiv 2021, arXiv:2011.13922. [Google Scholar]
Chen, S.; Guhur, P.L.; Schmid, C.; Laptev, I. History aware multimodal Transformer for vision-and-language navigation. Adv. Neural Inf. Process. Syst. 2021, 34. [Google Scholar]
Pashevich, A.; Schmid, C.; Sun, C. Episodic Transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15942–15952. [Google Scholar]
Sun, M.; Ma, X. Adversarial imitation learning from incomplete demonstrations. arXiv 2019, arXiv:1905.12310. [Google Scholar]
Bain, M.; Sammut, C. A Framework for Behavioural Cloning. Mach. Intell. 1995, 15, 103–129. [Google Scholar]
Ross, S.; Gordon, G.J.; Bagnell, J.A. No-regret reductions for imitation learning and structured prediction. In Proceedings of the AISTATS, Ft. Lauderdale, FL, USA, 11–13 April 2011. [Google Scholar]
Ng, A.Y.; Russell, S.J. Algorithms for inverse reinforcement learning. In ICML’00, Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA, 29 June 2000–2 July 2000; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000; Volume 1, p. 2. [Google Scholar]
Gao, Y.; Xu, H.; Lin, J.; Yu, F.; Levine, S.; Darrell, T. Reinforcement learning from imperfect demonstrations. arXiv 2018, arXiv:1802.05313. [Google Scholar]
Nair, A.; McGrew, B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 6292–6299. [Google Scholar]
Zhang, J.; Cho, K. Query-efficient imitation learning for end-to-end simulated driving. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Ho, J.; Ermon, S. Generative adversarial imitation learning. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Cai, X.; Ning, H.; Dhelim, S.; Zhou, R.; Zhang, T.; Xu, Y.; Wan, Y. Robot and its living space: A roadmap for robot development based on the view of living space. Digit. Commun. Netw. 2020, 505–517. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 27 2016; pp. 770–778. [Google Scholar]
Zhu, Y.; Weng, Y.; Zhu, F.; Liang, X.; Ye, Q.; Lu, Y.; Jiao, J. Self-Motivated Communication Agent for Real-World Vision-Dialog Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1594–1603. [Google Scholar]
Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3d: Learning from rgb-d data in indoor environments. arXiv 2017, arXiv:1709.06158. [Google Scholar]

Figure 1. A demonstration of the AKCR-AH navigation process. Initially, the agent stands in the bedroom at “1” and is requested to “Find a clock in the dining room”. The agent begins to navigate with knowledge graph reasoning, but gets lost at “2” in the hallway by the bedroom door, so it signals the simulated human assistant for help. Upon request, the assistant provides a single-step teacher action “Go Forward” through direct intervention. After receiving help, the agent continues to navigate and passes “3”, then reaches the living room. At “4”, the agent gets lost one more time, and its query budget has not been exhausted. It thus asks the assistant for help again. Finally, the agent successfully finds the target object in the dining room at “5” without further assistance and stops.

Figure 2. The overall framework of AKCR-AH. The language and vision attention modules extract room/object features from instruction and visual observations, respectively. AIEPKR performs room- and object-entity reasoning to generate the memory token by integrating graph-based commonsense knowledge from ConceptNet. Then, the memory-based Transformer models the memory token sequence for action prediction. The simulated human assistant will provide help when heuristic-based rules are met.

Figure 3. Illustration of the AIEPKR module.

Figure 4. Comparison of the learning curves during training. AK represents our AKCR-AH model.

Figure 5. The comparison results of different ablation baselines on VNLA. SR (%), RSR (%), and NE (m) are reported for both Test Seen and Test Unseen scenes. Apart from NE, a higher value indicates the better result.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Zhang, Y.; Yuan, W.; Luo, J. Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help. Appl. Sci. 2022, 12, 7053. https://doi.org/10.3390/app12147053

AMA Style

Li X, Zhang Y, Yuan W, Luo J. Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help. Applied Sciences. 2022; 12(14):7053. https://doi.org/10.3390/app12147053

Chicago/Turabian Style

Li, Xin, Yu Zhang, Weilin Yuan, and Junren Luo. 2022. "Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help" Applied Sciences 12, no. 14: 7053. https://doi.org/10.3390/app12147053

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help

Abstract

1. Introduction

2. Related Work

2.1. Natural-Language-Grounded Visual Navigation

2.1.1. Vision-and-Language Navigation

2.1.2. Vision-and-Dialog Navigation

2.2. Vision–Language Reasoning with External Knowledge

2.3. Multimodal Transformers

2.4. Imitation Learning

3. Method

3.1. Problem Setup and Overview

3.1.1. Setup

3.1.2. Overview

3.2. Attention-Based Knowledge-Enabled Cross-Modality Reasoning

3.2.1. Language Attention Module

3.2.2. Vision Attention Module

3.2.3. Attention-Based Internal–External Parallel Knowledge Graph Reasoning

3.2.4. Room Relation Reasoning

3.3. Memory-Based Transformer

3.4. Modeling Human Help

4. Environment and Data

4.1. Matterport3D Simulator

4.2. Dataset

5. Implementation

5.1. Notation

5.2. Agent

5.3. Teacher

5.4. Learning

5.5. Training

5.5.1. Imitation Learning Loss

5.5.2. Room Classification Loss

5.5.3. Total Loss

6. Experiments

6.1. Experimental Setup

6.1.1. Evaluation Metrics

6.1.2. Implementation Details

6.2. Results and Analysis

6.2.1. Learning Processes

6.2.2. Main Results

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI