Learning Hierarchical Reasoning for Text-Based Visual Question Answering

Li, Caiyuan; Du, Qinyi; Wang, Qingqing; Jin, Yaohui

doi:10.1007/978-3-030-86365-4_25

Caiyuan Li¹²,
Qinyi Du¹²,
Qingqing Wang¹² &
…
Yaohui Jin¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12893))

Included in the following conference series:

International Conference on Artificial Neural Networks

3002 Accesses
2 Citations
3 Altmetric

Abstract

Text-based visual question answering (TextVQA) task needs to answer questions based on the objects and text information in image, which involves the joint reasoning over three modalities - question, visual objects, and text in image. Recent approaches on textVQA regard three modalities as joint input of transformers. However, these implicit reasoning methods do not make full use of multi-modal information, especially visual modality. To this end, we propose a novel model for textVQA based on reasoning explicitly in human-like mode. Firstly, the relevance between different objects and question is obtained. Then, the object modality is fused into the text modality weighted by obtained relevance. Finally, the amended text modality is used to predict the answer. In contrast to previous multi-modal free fusion strategy, our method can make the reasoning process more explicit and robust. Moreover, a prior-based loss is proposed to constrain object-question relevance. Extensive experimental results on several benchmark datasets well demonstrate the superior performance of our hierarchical reasoning framework over current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

Two-Stage Reasoning Network with Modality Decomposition for Text VQA

Multiple Interaction Learning with Question-Type Prior Knowledge for Constraining Answer Search Space in Visual Question Answering

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 6077–6086. IEEE Computer Society (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Google Scholar
Ben-younes, H., Cadène, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV (2017)
Google Scholar
Biten, A.F., et al.: Scene text visual question answering. In: ICCV (2019)
Google Scholar
Biten, A.F., et al.: ICDAR 2019 competition on scene text visual question answering. In: ICDAR, pp. 1563–1570
Google Scholar
Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXX. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Su, J., Carreras, X., Duh, K. (eds.) EMNLP (2016)
Google Scholar
Gao, C., et al.: Structured multimodal attentions for TextVQA. CoRR abs/2006.00753 (2020)
Google Scholar
Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. In: CVPR (2020)
Google Scholar
Han, W., Huang, H., Han, T.: Finding the evidence: localization-aware answer prediction for text visual question answering. CoRR abs/2010.02582 (2020)
Google Scholar
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. In: CVPR (2020)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.: VisualBERT: a simple and performant baseline for vision and language. CoRR abs/1908.03557 (2019)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXX. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)
Google Scholar
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. In: Walker, M.A., Ji, H., Stent, A. (eds.) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, vol. 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Google Scholar
Singh, A., et al.: Pythia-a platform for vision & language research. In: SysML Workshop, NeurIPS, vol. 2018 (2018)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Google Scholar
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net (2020)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Google Scholar
Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28
Chapter Google Scholar
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 4904–4912. IEEE Computer Society (2017)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (Grant No. 2020YFC2008700).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Caiyuan Li, Qinyi Du, Qingqing Wang & Yaohui Jin

Authors

Caiyuan Li
View author publications
Search author on:PubMed Google Scholar
Qinyi Du
View author publications
Search author on:PubMed Google Scholar
Qingqing Wang
View author publications
Search author on:PubMed Google Scholar
Yaohui Jin
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yaohui Jin.

Editor information

Editors and Affiliations

Comenius University in Bratislava, Bratislava, Slovakia
Igor Farkaš
iMotions A/S, Copenhagen, Denmark
Paolo Masulli
University of Tübingen, Tübingen, Baden-Württemberg, Germany
Sebastian Otte
Universität Hamburg, Hamburg, Germany
Stefan Wermter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, C., Du, Q., Wang, Q., Jin, Y. (2021). Learning Hierarchical Reasoning for Text-Based Visual Question Answering. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12893. Springer, Cham. https://doi.org/10.1007/978-3-030-86365-4_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-86365-4_25
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86364-7
Online ISBN: 978-3-030-86365-4
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Learning Hierarchical Reasoning for Text-Based Visual Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

Two-Stage Reasoning Network with Modality Decomposition for Text VQA

Multiple Interaction Learning with Question-Type Prior Knowledge for Constraining Answer Search Space in Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Keywords

Publish with us

Profiles

Subscribe and save

Buy Now

Learning Hierarchical Reasoning for Text-Based Visual Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering

Two-Stage Reasoning Network with Modality Decomposition for Text VQA

Multiple Interaction Learning with Question-Type Prior Knowledge for Constraining Answer Search Space in Visual Question Answering

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us

Profiles