Abstract
In order to be able to answer a natural language question, a computational system needs three main capabilities. First, the system needs to be able to analyze the question into a structured query, revealing its component parts and how these are combined. Second, it needs to have access to relevant knowledge sources, such as databases, texts or images. Third, it needs to be able to execute the query on these knowledge sources. This paper focuses on the first capability, presenting a novel approach to semantically parsing questions expressed in natural language. The method makes use of a computational construction grammar model for mapping questions onto their executable semantic representations. We demonstrate and evaluate the methodology on the CLEVR visual question answering benchmark task. Our system achieves a 100% accuracy, effectively solving the language understanding part of the benchmark task. Additionally, we demonstrate how this solution can be embedded in a full visual question answering system, in which a question is answered by executing its semantic representation on an image. The main advantages of the approach include (i) its transparent and interpretable properties, (ii) its extensibility, and (iii) the fact that the method does not rely on any annotated training data.
Funding source: FWO
Award Identifier / Grant number: 1SB6219N
Funding statement: We would like to thank Roxana Radulescu, Mathieu Reymond and Kyriakos Efthymiadis for the brainstorming sessions that have led to this publication. We also thank Remi van Trijp for his constructive feedback on earlier versions of this paper. Finally, we are grateful to the two anonymous reviewers for their valuable comments that greatly improved the final version of this paper. This work was supported by the Research Foundation Flanders (FWO), funder id: http://dx.doi.org/10.13039/501100003130, through grant 1SB6219N.
References
Abou-Assaleh, T., N. Cercone & V. Keselj. 2005. Question-answering with relaxed unification. In Proceedings of the Conference Pacific Association for Computational Linguistics, volume 5. Tokyo, Japan: Pacific Association for Computational Linguistics.Search in Google Scholar
Agrawal, A., D. Batra & D. Parikh. 2016. Analyzing the behavior of visual question answering models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1955–1960. Austin, TX, USA: Association for Computational Linguistics.10.18653/v1/D16-1203Search in Google Scholar
Andreas, J., M. Rohrbach, T. Darrell & D. Klein. 2016a. Learning to compose neural networks for question answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1545–1554. San Diego, CA, USA: Association for Computational Linguistics.10.18653/v1/N16-1181Search in Google Scholar
Andreas, J., M. Rohrbach, T. Darrell & D. Klein. 2016b. Neural module networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 39–48. Las Vegas, NV, USA: IEEE.10.1109/CVPR.2016.12Search in Google Scholar
Frank, A., H.-U. Krieger, F. Xu, H. Uszkoreit, B. Crysmann, B. Jörg & U. Schäfer. 2007. Question answering from structured knowledge sources. Journal of Applied Logic 5(1). 20–48.10.1016/j.jal.2005.12.006Search in Google Scholar
Hoffmann, T. & G. Trousdale. 2013. Construction Grammar: Introduction. In Thomas Hoffmann & Graeme Trousdale (eds.), The Oxford Handbook of Construction Grammar, Oxford: Oxford University Press.10.1093/oxfordhb/9780195396683.001.0001Search in Google Scholar
Hu, R., J. Andreas, M. Rohrbach, T. Darrell & K. Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the 2017 IEEE International Conference on Computer Vision, 804–813. Venice, Italy: IEEE.10.1109/ICCV.2017.93Search in Google Scholar
Johnson, J., B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick & R. Girshick. 2017a. Inferring and executing programs for visual reasoning. In Proceedings of the 2017 IEEE International Conference on Computer Vision, 3008–3017. Venice, Italy: IEEE.10.1109/ICCV.2017.325Search in Google Scholar
Johnson, J., B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick & R. Girshick. 2017b. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 1988–1997. Honolulu, HI, USA: IEEE.10.1109/CVPR.2017.215Search in Google Scholar
Li, P. & L. Liao. 2012. Web question answering based on CCG parsing and DL ontology. In 8th International Conference on Information Science and Digital Content Technology, volume 1, 212–217. Hyatt Regency Jeju, Korea: IEEE.Search in Google Scholar
Malinowski, M., M. Rohrbach & M. Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the 2015 IEEE International Conference on Computer Vision, 1–9. Santiago, Chile: IEEE.10.1109/ICCV.2015.9Search in Google Scholar
Mao, J., C. Gan, P. Kohli, J. B. Tenenbaum & J. Wu. 2019. The Neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations. New Orleans, LA, USA: Open Review. https://openreview.net/forum?id=rJgMlhRctm.Search in Google Scholar
Mascharka, D., P. Tran, R. Soklaski & A. Majumdar. 2018. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 4942–4950. Istanbul, Turkey: IEEE.10.1109/CVPR.2018.00519Search in Google Scholar
McFetridge, P., F. Popowich & D. Fass. 1996. An analysis of compounds in HPSG (head-driven phrase structure grammar) for database queries. Data & Knowledge Engineering 20(2). 195–209.10.1016/S0169-023X(96)00033-XSearch in Google Scholar
Noh, H., P. Hongsuck Seo & B. Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 30–38. Las Vegas, NV, USA: IEEE.10.1109/CVPR.2016.11Search in Google Scholar
Ren, M., R. Kiros & R. S. Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 2, 2953–2961. Montreal, Canada: MIT Press.Search in Google Scholar
Shamsfard, M. & M. A. Yarmohammadi. 2010. A semantic approach to extract the final answer in SBUQA question answering system. International Journal of Digital Content Technology and its Applications 4(7). 165–176.10.4156/jdcta.vol4.issue7.16Search in Google Scholar
Spranger, M., S. Pauw, M. Loetzsch & L. Steels. 2012. Open-ended procedural semantics. In L. Steeks and M. Hild (eds.), Language grounding in robots, 159–178. New York: Springer.10.1007/978-1-4614-3064-3_8Search in Google Scholar
Steels, L. 2007. The recruitment theory of language origins. In C. Lyon, C. L. Nehaniv, & A. Cangelosi (eds.), Emergence of Language and Communication, 129–151. Berlin: Springer.10.1007/978-1-84628-779-4_7Search in Google Scholar
Steels, L. (ed.). 2011. Design Patterns in Fluid Construction Grammar. Amsterdam: John Benjamins.10.1075/cal.11Search in Google Scholar
Steels, L. 2017. Basics of Fluid Construction Grammar. Constructions and Frames 9(2). 178–225.10.1075/bct.106.cf.00002.steSearch in Google Scholar
Xu, H. & K. Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, 451–466. Amsterdam, The Netherlands: Springer.10.1007/978-3-319-46478-7_28Search in Google Scholar
Yang, Z., X. He, J. Gao, L. Deng & A. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 21–29. Las Vegas, NV, USA: IEEE.10.1109/CVPR.2016.10Search in Google Scholar
Yarmohammadi, M. A., M. Shamsfard, M. A. Yarmohammadi and M. Rouhizadeh. 2008. SBUQA question answering system. In Computer Society of Iran Computer Conference, 316–323. Kish Island, Iran: Springer.10.1007/978-3-540-89985-3_39Search in Google Scholar
Yi, K., J. Wu, C. Gan, A. Torralba, P. Kohli & J. Tenenbaum. 2018. Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, 1031–1042. Montreal, Canada: MIT Press.Search in Google Scholar
Zettlemoyer, L. S. & M. Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In UAI’05 Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, 658–666. Edinburgh, Scotland: AUAI Press Arlington.Search in Google Scholar
Zhang, P., Y. Goyal, D. Summers-Stay, D. Batra & D. Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 5014–5022. Las Vegas, NV, USA: IEEE.10.1109/CVPR.2016.542Search in Google Scholar
Zhou, B., Y. Tian, S. Sukhbaatar, A. Szlam & R. Fergus. 2015. Simple baseline for visual question answering. arXiv e-prints. arXiv:1512.02167.Search in Google Scholar
© 2019 Walter de Gruyter GmbH, Berlin/Boston