Abstract
Text-based visual question answering (TextVQA) task needs to answer questions based on the objects and text information in image, which involves the joint reasoning over three modalities - question, visual objects, and text in image. Recent approaches on textVQA regard three modalities as joint input of transformers. However, these implicit reasoning methods do not make full use of multi-modal information, especially visual modality. To this end, we propose a novel model for textVQA based on reasoning explicitly in human-like mode. Firstly, the relevance between different objects and question is obtained. Then, the object modality is fused into the text modality weighted by obtained relevance. Finally, the amended text modality is used to predict the answer. In contrast to previous multi-modal free fusion strategy, our method can make the reasoning process more explicit and robust. Moreover, a prior-based loss is proposed to constrain object-question relevance. Extensive experimental results on several benchmark datasets well demonstrate the superior performance of our hierarchical reasoning framework over current state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 6077–6086. IEEE Computer Society (2018)
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Ben-younes, H., Cadène, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV (2017)
Biten, A.F., et al.: Scene text visual question answering. In: ICCV (2019)
Biten, A.F., et al.: ICDAR 2019 competition on scene text visual question answering. In: ICDAR, pp. 1563–1570
Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXX. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Su, J., Carreras, X., Duh, K. (eds.) EMNLP (2016)
Gao, C., et al.: Structured multimodal attentions for TextVQA. CoRR abs/2006.00753 (2020)
Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. In: CVPR (2020)
Han, W., Huang, H., Han, T.: Finding the evidence: localization-aware answer prediction for text visual question answering. CoRR abs/2010.02582 (2020)
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. In: CVPR (2020)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.: VisualBERT: a simple and performant baseline for vision and language. CoRR abs/1908.03557 (2019)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXX. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Peters, M.E., et al.: Deep contextualized word representations. In: Walker, M.A., Ji, H., Stent, A. (eds.) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, vol. 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Singh, A., et al.: Pythia-a platform for vision & language research. In: SysML Workshop, NeurIPS, vol. 2018 (2018)
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net (2020)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 4904–4912. IEEE Computer Society (2017)
Acknowledgements
This work was supported by the National Key R&D Program of China (Grant No. 2020YFC2008700).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, C., Du, Q., Wang, Q., Jin, Y. (2021). Learning Hierarchical Reasoning for Text-Based Visual Question Answering. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12893. Springer, Cham. https://doi.org/10.1007/978-3-030-86365-4_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-86365-4_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86364-7
Online ISBN: 978-3-030-86365-4
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science