The task of Visual Question Answering (VQA) demands that an agent correctly answer a previously unseen question about a previously unseen image. The fact that neither the question nor the image is specified until test time means that the agent must embody most of the achievements of Computer Vision and Natural Language Processing, and many of those of Artificial Intelligence.

VQA is typically framed in a purely supervised learning setting. A large training set of example questions, images, and their correct answers is used to train a method to map a question and image to scores over a predetermined, fixed vocabulary of possible answers using the maximum likelihood [39]. This approach has inherent scalability issues, as it attempts to represent all world knowledge within the finite set of parameters of a model such as deep neural network. Consequently, a trained VQA system can only be expected to produce correct answers to questions from a very similar distribution to those in the training set. Extending the model knowledge or expanding its domain coverage is only possible by retraining it from scratch, which is computationally costly, at best. This approach is thus fundamentally incapable of fulfilling the ultimate promise of VQA, which is answering general questions about general images.

Fig. 1.
figure 1

This paper considers visual question answering in a meta learning setting. The model is initially trained on a small set of questions/answers, and is provided with a, possibly large, additional support set of examples at test time. The model must learn to learn, or to exploit the additional data on-the-fly, without the need for retraining the model. Notably, performance improves as additional and more relevant examples are included.

As a solution to these issues we propose a meta-learning approach to the problem. The meta learning approach implies that the model learns to learn, i.e. it learns to use a set of examples provided at test time to answer the given question (Fig. 1). Those examples are questions and images, each with their correct answer, such as might form part of the training set in a traditional setting. They are referred to here as the support set. Importantly, the support set is not fixed. Note also that the support set may be large, and that the majority of its elements may have no relevance to the current question. It is provided to the model at test time, and can be expanded with additional examples to increase the capabilities of the model. The model we propose ‘learns to learn’ in that it is able to identify and exploit the relevant examples within a potentially large support set dynamically, at test time. Providing the model with more information thus does not require retraining, and the ability to exploit such a support set greatly improves the practicality and scalability of the system. Indeed, it is ultimately desirable for a practical VQA system to be adaptable to new domains and to continuously improve as more data becomes available. That vision is a long term objective and this work takes only a small step in that direction.

There is significant practical interest to the meta-learning approach to VQA. It can ultimately allow the following scenarios, which are well outside the reach of traditional approaches:

  • models using constantly expanding support (e.g. from knowledge bases, surveillance imagery, medical data, etc..) with no need for constant retraining;

  • models using support data too large to be captured within the weights of the model e.g. from web searches;

  • models trained and distributed without encapsulating sensitive data, for privacy or security reasons; after training the model on sanitized data, it is provided at test time only with the sensitive information.

Our central technical contribution is to adapt a state-of-the-art VQA model [34] to the meta learning setting. The resulting model is a deep neural network that uses sets of dynamic parameters – also known as fast weights – determined at test time depending on the provided support set. The dynamic parameters allow to modify adaptively the computations performed by the network and adapt its behaviour depending on the support set. We perform a detailed study to evaluate the effectiveness of those techniques under various regimes of training and support set sizes. Those experiments are based on the VQA v2 benchmark, for which we propose data splits appropriate to study a meta learning setting.

A completely new capability demonstrated by the resulting system is to learn to produce completely novel answers (i.e. answers not seen during training). Those new answers are only demonstrated by instances of the support set provided at test time. In addition to these new capabilities, the system exhibits a qualitatively distinct behaviour to existing VQA systems in its improved handling of rare answers. Since datasets for VQA exhibit a heavy class imbalance, with a small number of answers being much more frequent than most others, models optimized for current benchmarks are prone to fall back on frequent “safe” answers. In contrast, the proposed model is inherently less likely to fall victim to dataset biases, and exhibits a higher recall over rare answers. The proposed model does not surpass existing methods on the common aggregate accuracy metric, as is to be expected given that it does not overfit to the dataset bias, but it nonetheless exhibits desirable traits overall.

The contributions of this paper are summarized as follows.

  1. 1.

    We re-frame VQA as a meta learning task, in which the model is provided a test time with a support set of supervised examples (questions and images with their correct answers).

  2. 2.

    We describe a neural network architecture and training procedure able to leverage the meta learning scenario. The model is based on a state-of-the-art VQA system and takes inspiration in techniques from the recent meta learning literature, namely prototypical networks [33] and meta networks [24].

  3. 3.

    We provide an experimental evaluation of the proposed model in different regimes of training and support set sizes and across variations in design choices.

  4. 4.

    Our results demonstrate the unique capability of the model to produce novel answers, i.e. answers never seen during training, by learning from support instances, an improved recall of rare answers, and a better sample efficiency than existing models.

1 Related Work

Visual Question Answering Visual question answering has gathered significant interest from the computer vision community [6], as it constitutes a practical setting to evaluate deep visual understanding. In addition to visual parsing, VQA requires the comprehension of a text question, and combined reasoning over vision and language, sometimes on the basis of external or common-sense knowledge. See [39] for a recent survey of methods and datasets.

VQA is always approached in a supervised setting, using large datasets [6, 15, 22, 44] of human-proposed questions with their correct answers to train a machine learning model. The VQA-real and VQA v2 datasets [6, 15] have served as popular benchmarks by which to evaluate and compare methods. Despite the large scale of those datasets, e.g. more than 650,000 questions in VQA v2, several limitations have been recognized. These relate to the dataset bias (i.e. the non-uniform, long-tailed distribution of answers) and the question-conditioned bias (making answers easy to guess given a question without the image). For example, the answer Yes is particularly prominent in [6] compared to no, and questions starting with How many can be answered correctly with the answer two more than 30% of the time [15]. These issues plague development in the field by encouraging methods which fare well on common questions and concepts, rather than on rare answers or more complicated questions. The aggregate accuracy metric used to compare methods is thus a poor indication of method capabilities for visual understanding. Improvements to datasets have been introduced [1, 15, 43], including the VQA v2, but they only partially solve the evaluation problems. An increased interest has appeared in the handling of rare words and answers [29, 35]. The model proposed in this paper is inherently less prone to incorporate dataset biases than existing methods, and shows superior performance for handling rare answers. It accomplishes this by keeping a memory made up of explicit representations of training and support instances.

VQA with Additional Data In the classical supervised setting, a fixed set of questions and answers is used to train a model once and for all. With few exceptions, the performance of such a model is fixed as it cannot use additional information at test time. Among those exceptions, [38, 40] use an external knowledge base to gather non-visual information related to the input question. In [35], the authors use visual information from web searches in the form of exemplar images of question words, and better handle rare and novel words appearing in questions as a result. In [34], the same authors use similar images from web searches to obtain visual representations of candidate answers.

Those methods use ad-hoc engineered techniques to incorporate external knowledge in the VQA model. In comparison, this paper presents a much more general approach. We expands the model knowledge with data provided in the form of additional supervised examples (questions and images with their correct answer). A demonstration of the broader generality of our framework over the works above is its ability to produce novel answers, i.e. never observed during initial training and learned only from test-time examples.

Recent works on text-based question answering have investigated the retrieval of external information with reinforcement learning [8, 25, 26]. Those works are tangentially related and complementary to the approach explored in this paper.

Meta Learning and Few Shot Learning The term meta learning broadly refers to methods that learn to learn, i.e. that train models to make better use of training data. It applies to approaches including the learning of gradient descent-like algorithms such as [5, 13, 17, 30] for faster training or fine-tuning of neural networks, and the learning of models that can be directly fed training examples at test time [7, 33, 36]. The method we propose falls into the latter category. Most works on meta learning are motivated by the challenge of one-shot and few-shot visual recognition, where the task is to classify an image into categories defined by a few examples each. Our meta learning setting for VQA bears many similarities. VQA is treated as a classification task, and we are provided, at test time, with examples that illustrate the possible answers – possibly a small number per answer. Most few-shot learning methods are, however, not directly applicable to our setting, due to the large number of classes (i.e. possible answers), the heavy class imbalance, and the need to integrate into an architecture suitable to VQA. For example, recent works such as [36] propose efficient training procedures that are only suitable for a small number of classes.

Our model uses a set of memories within a neural network to store the activations computed over the support set. Similarly, Kaiser et al.  [19] store past activations to remember “rare events”, which was notably evaluated on machine translation. Our model also uses network layers parametrized by dynamic weights, also known as fast weights. Those are determined at test time depending on the actual input to the network. Dynamic parameters have a long history in neural networks [32] and have been used previously for few-shot recognition [7] and for VQA [27]. One of the memories within our network stores the gradient of the loss with respect to static weights of the network, which is similar to the Meta Networks model proposed by Munkhdalai et al.  [24]. Finally, our output stage produces scores over possible answers by similarity to prototypes representing the output classes (answers). This follows a similar idea to the Prototypical Networks [33].

Continuum Learning An important outcome of framing VQA in a meta learning setting is to develop models capable of improving as more data becomes available. This touches the fields of incremental [12, 31] and continuum learning [2, 23, 42]. Those works focus on the fine-tuning of a network with new training data, output classes and/or tasks. In comparison, our model does not modify itself over time and cannot experience negative domain shift or catastrophic forgetting, which are a central concern of continuum learning [21]. Our approach is rather to use such additional data on-the-fly, at test time, i.e. without an iterative retraining. An important motivation for our framework is its potential to apply to support data of a different nature than question/answer examples. We consider this to be an important direction for future work. This would allow to leverage general, non VQA-specific data, e.g. from knowledge bases or web searches.

Fig. 2.
figure 2

Overview of the proposed model. We obtain an embedding the input question and image following [34] and our contributions concern the mapping of this embedding to scores over a set of candidate answers. First, a non-linear transformation (implemented as a gated hyperbolic tangent layer) is parametrized by static and dynamic weights. Static ones are learned like traditional weights by gradient descent, while dynamic ones are determined based on the actual input and a memory of candidate dynamic weights filled by processing the support set. Second, a similarity measure compares the resulting feature vector to a set of prototypes, each representing a specific candidate answer. Static prototypes are learned like traditional weights, while dynamic prototypes are determined by processing the support set. Dashed lines indicate data flow during the processing of the support set. See Sect. 3 for details.

2 VQA in a Meta Learning Setting

The traditional approach to VQA is in a supervised setting described as follows. A model is trained to map an input question \(\mathsf {Q}\) and image \(\mathsf {I}\) to scores over candidate answers [39]. The model is trained to maximize the likelihood of correct answers over a training set \({\mathcal {T}}\) of triplets \((\mathsf {Q},\mathsf {I},\hat{{\varvec{s}}})\), where \(\varvec{\hat{s}}\in [0,1]^A\) represents the vector of ground truth scores of the predefined set of A possible answers. At test time, the model is evaluated on another triplet \((\mathsf {Q}',\mathsf {I}',\hat{{\varvec{s}}}')\) from an evaluation or test set \({\mathcal {E}}\). The model predicts scores \({\varvec{s}}'\) over the set of candidate answers, which can be compared to the ground truth \(\hat{{\varvec{s}}}'\) for evaluation purposes.

We extend the formulation above to a meta learning setting by introducing an additional support set \({\mathcal {S}}\) of similar triplets \((\mathsf {Q}'',\mathsf {I}'',\hat{{\varvec{s}}}'')\). These are provided to the model at test time. At a minimum, we define the support set to include the training examples themselves, i.e. \({\mathcal {S}}={\mathcal {T}}\), but more interestingly, the support set can include novel examples \({\mathcal {S'}}\) provided at test time. They constitute additional data to learn from, such that \({\mathcal {S}}={\mathcal {T}}\cup {\mathcal {S'}}\). The triplets \((\mathsf {Q},\mathsf {I},\hat{{\varvec{s}}})\) in the support set can also include novel answers, never seen in the training set. In that case, the ground truth score vectors \(\hat{{\varvec{s}}}\) of the other elements in the support are simply padded with zeros to match the larger size \(A'\) of the extended set of answers.

The following sections describe a deep neural network that can take advantage of the support set at test time. To leverage the information contained in the support set, the model must learn to utilize these examples on-the-fly at test time, without retraining of the whole model.

3 Proposed Model

The proposed model (Fig. 2) is a deep neural network that extends the state-of-art VQA system of Teney et al. [34]. Their system implements the joint embedding approach common to most modern VQA models [18, 20, 39, 41], which is followed by a multi-label classifier over candidate answers. Conceptually, we separate the architecture into (1) the embedding part that encodes the input question and image, and (2) the classifier part that handles the reasoning and actual question answeringFootnote 1. The contributions of this paper address only the second part. Our contributions are orthogonal to developments on the embedding part, which could also benefit e.g. from advanced attention mechanisms or other computer vision techniques [3, 37, 39]. We follow the implementation of [34] for the embedding part. For concreteness, let us mention that the question embedding uses GloVe word vectors [28] and a Recurrent Gated Unit (GRU [10]). The image embedding uses features from a CNN (Convolutional Neural Network) with bottom-up attention [3] and question-guided attention over those features. See [34] for details.

For the remainder of this paper, we abstract the embedding to modules that produce respectively the question and image vectors \({\varvec{q}}\) and \({\varvec{v}}\in \mathbb {R}^D\). They are combined with a Hadamard (element-wise) product into \({\varvec{h}}= {\varvec{q}}~\circ ~ {\varvec{v}}\), which forms the input to the classifier on which we now focus on. The role of the classifier is to map \({\varvec{h}}\) to a vector of scores \({\varvec{s}}\in [0,1]^A\) over the candidate answers. We propose a definition of the classifier that generalizes the implementation of traditional models such as [34]. The input to the classifier \({\varvec{h}}\in \mathbb {R}^D\) is first passed through a non-linear transformation \(f_{\varvec{\theta }}: \mathbb {R}^D\rightarrow \mathbb {R}^D\), then through a mapping to scores over the set of candidate answers \(g_ \Phi : \mathbb {R}^D\rightarrow [0,1]^A\). This produces a vector of predicted scores \({\varvec{s}}= g_ \Phi (f_{\varvec{\theta }}({\varvec{h}}))\). In traditional models, the two functions correspond to a stack of non-linear layers for \(f_{\varvec{\theta }}\), and a linear layer followed by a softmax or sigmoid for \(g_ \Phi \). We now show how to extend \(f_{\varvec{\theta }}\) and \(g_ \Phi \) to take advantage of the meta learning setting.

3.1 Non-linear Transformation

The \(f_{\varvec{\theta }}(\cdot )\) role of the non-linear transformation \(f_{\varvec{\theta }}({\varvec{h}})\) is to map the embedding of the question/image \({\varvec{h}}\) to a representation suitable for the following (typically linear) classifier. This transformation can be implemented in a neural network with any type of non-linear layers. Our contributions are agnostic to this implementation choice. We follow [34] and use a gated hyperbolic tangent layer [11], defined as

$$\begin{aligned} f_{\varvec{\theta }}({\varvec{h}}) ~~=~~ \sigma ( W {\varvec{h}}+ \varvec{b}) ~\circ ~ \text {tanh}\,(W' {\varvec{h}}+ \varvec{b}' ) \end{aligned}$$
(1)

where \(\sigma \) is the logistic activation function, \(W, W' \in \mathbb {R}^{D \times D}\) are learned weights, \(\varvec{b}, \varvec{b}' \in \mathbb {R}^D\) are learned biases, and \(\circ \) is the Hadamard (element-wise) product. We define the parameters \({\varvec{\theta }}\) as the concatenation of the vectorized weights and biases, i.e. \({\varvec{\theta }}=[W_:;W'_:;\varvec{b};\varvec{b}']\), where colons denote the vectorization of matrices. The vector \({\varvec{\theta }}\) thus contains all weights and biases used by the non-linear transformation. A traditional model would learn the weights \({\varvec{\theta }}\) by backpropagation and gradient descent on the training set, and they would be held static during test time. We propose instead to adaptively adjust the weights at test time, depending on the input \({\varvec{h}}\) and the available support set. Concretely, we use a combination of static parameters \({\varvec{\theta }}^\mathsf {s}\) learned in the traditional manner, and dynamic ones \({\varvec{\theta }}^\mathsf {d}\) determined at test time. They are combined as \({\varvec{\theta }}={\varvec{\theta }}^\mathsf {s} \,+\, {\varvec{w}}{\varvec{\theta }}^\mathsf {d}\), with \({\varvec{w}}\in \mathbb {R}^D\) a vector of learned weights. The dynamic weights can therefore be seen as an adjustment made to the static ones depending on the input \({\varvec{h}}\).

A set of candidate dynamic weights are maintained in an associative memory \(\mathcal{M}\). This memory is a large set (as large as the support set, see Sect. 3.2) of key/value pairs \(\mathcal{M}=\{(\tilde{{\varvec{h}}_i},\tilde{{\varvec{\theta }}}^\mathsf {d}_i)\}_{i\in 1\ldots |{\mathcal {S}}|}\). The interpretation for \(\tilde{{\varvec{\theta }}}^\mathsf {d}_i\) is of dynamic weights suited to an input similar to \(\tilde{{\varvec{h}}_i}\). Therefore, at test time, we retrieve appropriate dynamic weights \({\varvec{\theta }}^\mathsf {d}\) by soft key matching:

(2)

where \(d_\mathsf {cos}(\cdot ,\cdot )\) is the cosine similarity function. We therefore retrieve a weighted sum, in which the similarity of \({\varvec{h}}\) with the memory keys \(\tilde{{\varvec{h}}_i}\) serves to weight the memory values \(\tilde{{\varvec{\theta }}}^\mathsf {d}_i\). In practice and for computational reasons, the softmax function cuts off after the top k largest values, with k in the order of a thousand elements (see Sect. 4). We detail in Sect. 3.2 how the memory is filled by processing the support set. Note that the above formulation can be made equivalent to the original model in [34] by using only static weights (\({\varvec{\theta }}={\varvec{\theta }}^\mathsf {s}\)). This serves as a baseline in our experiments (see Sect. 4).

Mapping to Candidate Answers \({{\varvec{g}}}_{\varvec{{\Phi }}}{} \mathbf{(}\varvec{\cdot }{} \mathbf{)}\) The function \(g_ \Phi ({\varvec{h}})\) maps the output of the non-linear transformation to a vector of scores \(s\in [0,1]^A\) over the set of candidate answers. It is traditionally implemented as a simple affine or linear transformation (i.e. a matrix multiplication). We generalize the definition of \(g_ \Phi ({\varvec{h}})\) by interpreting it as a similarity measure between its input \({\varvec{h}}\) and prototypes \( \Phi =\{\varvec{\phi }_i^a\}_{i,a}\) representing the possible answers. In traditional models, each prototype corresponds to one row of the weight matrix. Our general formulation allows one or several prototypes per possible answer a as \(\{\varvec{\phi }_i^a\}_{i=1}^{N^a}\) (where a is an index over candidate answers and i indexes the \(N_a\) support examples having a as a correct answer). Intuitively, the prototypes represent the typical expected feature vector when a is a correct answer. The score for a is therefore obtained as the similarity between the provided \({\varvec{h}}'\) and the corresponding prototypes of a. When multiple prototypes are available, the similarities are averaged. Concretely, we define

$$\begin{aligned} g^a_ \Phi ({\varvec{h}}') ~~=~~ \sigma \Big ( \, \frac{1}{N^a} \sum ^{N^a}_{i=1} d({\varvec{h}}', \varvec{\phi }_i^a) \, + b'' \, \Big ) \end{aligned}$$
(3)

where \(d(\cdot ,\cdot )\) is a similarity measure, \(\sigma \) is a sigmoid (logistic) activation function to map the similarities to [0, 1], and \(b''\) is a learned bias term. The traditional models that use a matrix multiplication [18, 34, 35] correspond to \(g_ \Phi (\cdot )\) that uses a dot product as the similarity function. In comparison, our definition generalizes to multiple prototypes per answer and to different similarity measures. Our experiments evaluate the dot product and the weighted L-p norm of vector differences:

$$\begin{aligned} d_\mathsf {dot}({\varvec{h}},{\varvec{\theta }})&~~=~~ {\varvec{h}}^\intercal \, {\varvec{\theta }}\end{aligned}$$
(4)
$$\begin{aligned} d_\mathsf {L1}({\varvec{h}},{\varvec{\theta }})&~~=~~ {\varvec{w}}'''^\intercal \, \left| {\varvec{h}}- {\varvec{\theta }}\right| \end{aligned}$$
(5)
$$\begin{aligned} d_\mathsf {L2}({\varvec{h}},{\varvec{\theta }})&~~=~~ {\varvec{w}}'''^\intercal \, ( {\varvec{h}}- {\varvec{\theta }})^2 \end{aligned}$$
(6)

where \({\varvec{w}}''' \in \mathbb {R}^D\) is a vector of learned weights applied coordinate-wise.

Our model uses two sets of prototypes, the static \( \Phi ^\mathsf {s}\) and the dynamic \( \Phi ^\mathsf {d}\). The static ones are learned during training as traditional weights by backpropagation and gradient descent, and held fixed at test time. The dynamic ones are determined at test time by processing the provided support set (see Sect. 3.2). Thereafter, all prototypes \( \Phi = \Phi ^\mathsf {s} \,\cup \, \Phi ^\mathsf {d}\) are used indistinctively. Note that our formulation of \(g_ \Phi (\cdot )\) can be made equivalent to the original model of [34] by using only static prototypes (\( \Phi = \Phi ^\mathsf {d}\)) and the dot-product similarity measure \(d_\mathsf {dot}(\cdot ,\cdot )\). This will serve as a baseline in our experiments (Sect. 4).

Finally, the output of the network is attached to a cross-entropy loss \(\mathscr {L}({\varvec{s}},\hat{{\varvec{s}}})\) between the predicted and ground truth for training the model end-to-end [34].

3.2 Processing of Support Set

Both functions \(f_{\varvec{\theta }}(\cdot )\) and \(g_ \Phi (\cdot )\) defined above use dynamic parameters that are dependent on the support set. Our model processes the entire support set in a forward and backward pass through the network as described below. This step is to be carried out once at test time, prior to making predictions on any instance of the test set. At training time, it is repeated before every epoch to account for the evolving static parameters of the network as training progresses (see the algorithm in the supplementary material).

We pass all elements of the support set \({\mathcal {S}}\) through the network in mini-batches for both a forward and backward pass. The evaluation of \(f_{\varvec{\theta }}(\cdot )\) and \(g_ \Phi (\cdot )\) use only static weights and prototypes, i.e. \({\varvec{\theta }}={\varvec{\theta }}^\mathsf {s}\) and \(\varvec{\phi }=\varvec{\phi }^\mathsf {s}\). To fill the memory \(\mathcal{M}\), we collect, for every element of the support set, its feature vector \({\varvec{h}}\) and the gradient \(\nabla _{{\varvec{\theta }}^\mathsf {s}} \mathscr {L}\) of the final loss relative to the static weights \({\varvec{\theta }}\). This effectively captures the adjustments that would be made by a gradient descent algorithm to those weights for that particular example. The pair \(({\varvec{h}},\nabla _{{\varvec{\theta }}^\mathsf {s}} \mathscr {L})\) is added to the memory \(\mathcal{M}\), which thus holds \(|{\mathcal {S}}|\) elements at the end of the process.

To determine the set of dynamic prototypes \(\varvec{\phi }^\mathsf {d}\), we collect the feature vectors \({\varvec{h}}'=f_{\varvec{\theta }}({\varvec{h}})\) over all instances of the support set. We then compute their average over instances having the same correct answer. Concretely, the dynamic prototype for answer a is obtained as \(\varvec{\phi }^a = \frac{1}{N^a} \sum _{i:\hat{s}_i^a=1}^{N^a} {\varvec{h}}'_i\).

During training, we must balance the need for data to train the static parameters of the network, and the need for an “example” support set, such that the network can learn to use novel data. If the network is provided with a fixed, constant support set, it will overfit to that input and be unable to make use of novel examples at test time. Our training procedure uses all available data as the training set \({\mathcal {T}}\), and we form a different support set \({\mathcal {S}}\) at each training epoch as a random subset of \({\mathcal {T}}\). The procedure is summarized in the algorithm provided in the supplementary material. Note that in practice, it is parallelized to process instances in mini-batches rather than individually.

4 Experiments

We perform a series of experiments to evaluate (1) how effectively the proposed model and its different components can use the support set, (2) how useful novel support instances are for VQA, (3) whether the model learns different aspects of a dataset from classical VQA methods trained in the classical setting.

Datasets The VQA v2 dataset [15] serves as the principal current benchmark for VQA. The heavy class imbalance among answers makes it very difficult to draw meaningful conclusions or perform a qualitative evaluation, however. We additionally propose a series of experiments on a subset referred to as VQA-Numbers. It includes all questions marked in VQA v2 as a “number” question, which are further cleaned up to remove answers appearing less than 1,000 times in the training set, and to remove questions that do not have an unambiguous answer (we keep only those with ground truth scores containing a single element equal to 1.0). Questions from the original validation set of VQA v2 are used for evaluation, and the original training set (45,965 questions after clean up) is used for training, support, and validation. The precise data splits will be available publicly. Most importantly, the resulting set of candidate answers corresponds to the seven numbers from 0 to 6. See details in the supplementary material.

Metrics The standard metric for evaluation on VQA v2 is the accuracy defined, using the notations of Sect. 2, as \(\frac{1}{|\mathscr {E}|} \sum _i \hat{s}_i^{a^\star _i}\) with ground truth scores \(\hat{s}_i\) and \({a^\star _i}\) the answer of highest predicted score, \({arg\!max}_a s_i^a\). We also define the recall of an answer a as \(\sum _i s_i^{a^\star }/\sum _i \hat{s}_i^a\). We look at the recall averaged (uniformly) over all possible answers to better reflect performance across a variety of answers, rather than on the most common ones.

Implementation Our implementation is based on the code provided by the authors of [34]. Details non-specific to our contributions can be found there. We initialize all parameters, in particular static weights and static prototypes as if they were those of a linear layer in a traditional architecture, following Glorot and Bengio [14]. During training, the support set is subsampled (Sect. 3.2) to yield a set of 1,000 elements. We use, per answer, one or two static prototypes, and zero or one dynamic prototype (as noted in the experiments). All experiments use an embedding dimension D=128 and a mini-batches of 256 instances. Experiments with VQA v2 use a set of candidate answers capped to a minimum number of training occurrences of 16, giving 1,960 possible answers [34]. Past works have shown that small differences in implementation can have noticeable impact on performance. Therefore, to ensure fair comparisons, we repeated all evaluations of the baseline [34] with our code and preprocessing. Results are therefore not directly comparable with those reported in [34]. In particular, we do not use the Visual Genome dataset [22] for training.

Table 1. On VQA-Numbers, ablative evaluation, trained and evaluated on all answers. See discussion in Sect. 4.1.

4.1 VQA-Numbers

Ablative Evaluation We first evaluate the components of the proposed model in comparison to the state-of-the-art of [34] which serves as a baseline, being equivalent to our model with 1 static prototype per answer, the dot product similarity, and no dynamic parameters. We train and evaluate on all 7 answers. To provide the baseline with a fair chanceFootnote 2, we train all models with standard supersampling [9, 16], i.e. selecting training examples with equal probability with respect to their correct answer. In these experiments, the support set is equal to the training set.

As reported in Table 1, the proposed dynamic weights improve over the baseline, and the dynamic prototypes bring an additional improvement. We compare different choices for the similarity function. Interestingly, swapping the dot product in the baseline for an L2 distance has a negative impact. When using two static prototypes however, the L2 distances proves superior to the L1 or to the dot product. This is consistent with [33] where a prototypes network also performed best with an L2 distance.

Additional Support Set and Novel Answers We now evaluate the ability of the model to exploit support data never seen until test time (see Fig. 3). We train the same models designed for 7 candidate answers, but only provide them with training data for a subset of them. The proposed model is additionally provided with a complete support set, covering all 7 answers. Each reported result is averaged over 10 runs. The set of k answers excluded from training is randomized across runs but identical to all models for a given k.

Fig. 3.
figure 3

On VQA-Numbers, performance of the proposed model and ablations, with training data for subsets of the 7 answers. (Left) Performance on all answers. (Right) Performance on answers not seen in training. Only the model with dynamic prototypes makes this setting possible. Remarkably, a model trained on two answers (2/7) maintains a capacity to learn about all others. Chance baseline shown as horizontal dashes.

The proposed model proves superior than the baseline and all other ablations (Fig. 3, top). The dynamic prototypes are particularly beneficial. With very little training data, the use of dynamic weights is less effective and sometimes even detrimental. We hypothesize that the model may then suffer from overfitting due to the additional learned parameters. When evaluated on novel answers (not seen during training and only present in the test-time support set), the dynamic prototypes provide a remarkable ability to learn those from the support set alone (Fig. 3, bottom). Their efficacy is particularly strong when only a single novel answer has to be learned. Remarkably, a model trained on only two answers maintains some capacity to learn about all others (average recall of \(17.05\%\), versus the chance baseline of \(14.28\%\)). Note that we cannot claim the ability of the model to count to those novel numbers, but at the very least it is able to associate those answers with particular images/questions (possibly utilizing question-conditioned biases).

4.2 VQA v2

We performed experiments on the complete VQA v2 dataset. We report results of different ablations, trained with 50% or 100% of the official training set, evaluated on the validation set as in [34]. The proposed model uses the remaining of the official training set as additional support data at test time. The complexity and varying quality of this dataset do not lead to clear-cut conclusions from the standard accuracy metric (see Table 2). The answer recall leads to more consistent observations that align with those made on VQA-Numbers. Both dynamic weights and dynamic parameters provide a consistent advantage (Fig. 4). Each technique is beneficial in isolation, but their combination performs generally best. Individually, the dynamic prototypes appear more impactful than the dynamic weights. Note that our experiments on VQA v2 aimed at quantifying the effect of the contributions in the meta learning setting, and we did not seek to maximize absolute performance in the traditional benchmark setting.

Fig. 4.
figure 4

On VQA v2, performance using varying amounts of training data. See Sect. 4.2.

Fig. 5.
figure 5

On VQA v2, difference in answer recall between the proposed model (Table 2, last row, last column) and the baseline (Table 2, first row, last column). Each blue bar corresponds to one of the candidate answers, sorted decreasing number of occurrences in the training set (gray background, units not displayed). The two models show qualitatively different behaviour: the baseline is effective with frequent answers, but the proposed model fares better (mostly positive values) in the long tail of rare answers.

Table 2. On VQA v2, evaluation of the proposed model and ablations (question accuracy/answer recall). The full proposed model exhibits qualitatively different strengths than the classical approach [34], producing a generally higher recall (averaged over possible answers) and lower accuracy (averaged over questions). In these experiments, the objective for a “perfect” metalearning model would be to match the performance of the baseline trained with 100% of data (row 1, right column), while using less training data and the remaining as support (last row, left column).

To obtain a better insight into the predictions of the model, we examine the individual recall of possible answers. We compare the values with those obtained by the baseline. The difference (Fig. 5) indicates which of the two models provides the best predictions for every answer. We observe a qualitatively different behaviour between the models. While the baseline is most effective with frequent answers, the proposed model fares better (mostly positive values) in the long tail of rare answers. This corroborates previous discussions on dataset biases [15, 18, 43] which classical models are prone to overfit to. The proposed model is inherently more robust to such behaviour.

5 Conclusions and Future Work

We have devised a new approach to VQA through framing it as a meta learning task. This approach enables us to provide the model with supervised data at test time, thereby allowing the model to adapt or improve as more data is made available. We believe this view could lead to the development of scalable VQA systems better suited to practical applications. We proposed a deep learning model that takes advantage of the meta learning scenario, and demonstrated a range of benefits: improved recall of rare answers, better sample efficiency, and a unique capability of to learn to produce novel answers, i.e. those never seen during training, and learned only from support instances.

The learning-to-learn approach we propose here enables a far greater separation of the questions answering method from the information used in the process than has previously been possible. Our contention is that this separation is essential if vision-and-language methods are to move beyond benchmarks to tackle real problems, because embedding all of the information a method needs to answer real questions in the model weights is impractical.

Even though the proposed model is able to use novel support data, the experiments showed room for improvement, since a model trained initially from the same amount of data still shows superior performance. Practical considerations should also be addressed to apply this model to a larger scale, in particular for handling the memory of dynamic weights that currently grows linearly with the support set. Clustering schemes could be envisioned to reduce its size [33] and hashing methods [4, 19] could improve the efficiency of the content-based retrieval.

Generally, the handling of additional data at test time opens the door to VQA systems that interact with other sources of information. While the proposed model was demonstrated with a support set of questions/answers, the principles extend to any type of data obtained at test time e.g. from knowledge bases or web searches. This would drastically enhance the scalability of VQA systems.