close
Skip to main content

Learning Hierarchical Reasoning for Text-Based Visual Question Answering

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2021 (ICANN 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12893))

Included in the following conference series:

Abstract

Text-based visual question answering (TextVQA) task needs to answer questions based on the objects and text information in image, which involves the joint reasoning over three modalities - question, visual objects, and text in image. Recent approaches on textVQA regard three modalities as joint input of transformers. However, these implicit reasoning methods do not make full use of multi-modal information, especially visual modality. To this end, we propose a novel model for textVQA based on reasoning explicitly in human-like mode. Firstly, the relevance between different objects and question is obtained. Then, the object modality is fused into the text modality weighted by obtained relevance. Finally, the amended text modality is used to predict the answer. In contrast to previous multi-modal free fusion strategy, our method can make the reasoning process more explicit and robust. Moreover, a prior-based loss is proposed to constrain object-question relevance. Extensive experimental results on several benchmark datasets well demonstrate the superior performance of our hierarchical reasoning framework over current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Free shipping worldwide - view details

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 6077–6086. IEEE Computer Society (2018)

    Google Scholar 

  2. Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)

    Google Scholar 

  3. Ben-younes, H., Cadène, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV (2017)

    Google Scholar 

  4. Biten, A.F., et al.: Scene text visual question answering. In: ICCV (2019)

    Google Scholar 

  5. Biten, A.F., et al.: ICDAR 2019 competition on scene text visual question answering. In: ICDAR, pp. 1563–1570

    Google Scholar 

  6. Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXX. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7

    Chapter  Google Scholar 

  7. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  8. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Su, J., Carreras, X., Duh, K. (eds.) EMNLP (2016)

    Google Scholar 

  9. Gao, C., et al.: Structured multimodal attentions for TextVQA. CoRR abs/2006.00753 (2020)

    Google Scholar 

  10. Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. In: CVPR (2020)

    Google Scholar 

  11. Han, W., Huang, H., Han, T.: Finding the evidence: localization-aware answer prediction for text visual question answering. CoRR abs/2010.02582 (2020)

    Google Scholar 

  12. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. In: CVPR (2020)

    Google Scholar 

  13. Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.: VisualBERT: a simple and performant baseline for vision and language. CoRR abs/1908.03557 (2019)

    Google Scholar 

  14. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXX. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  15. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)

    Google Scholar 

  16. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)

    Google Scholar 

  17. Peters, M.E., et al.: Deep contextualized word representations. In: Walker, M.A., Ji, H., Stent, A. (eds.) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, vol. 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018)

    Google Scholar 

  18. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)

    Google Scholar 

  19. Singh, A., et al.: Pythia-a platform for vision & language research. In: SysML Workshop, NeurIPS, vol. 2018 (2018)

    Google Scholar 

  20. Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)

    Google Scholar 

  21. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net (2020)

    Google Scholar 

  22. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)

    Google Scholar 

  23. Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28

    Chapter  Google Scholar 

  24. Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 4904–4912. IEEE Computer Society (2017)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (Grant No. 2020YFC2008700).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaohui Jin.

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, C., Du, Q., Wang, Q., Jin, Y. (2021). Learning Hierarchical Reasoning for Text-Based Visual Question Answering. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12893. Springer, Cham. https://doi.org/10.1007/978-3-030-86365-4_25

Download citation

Keywords

Publish with us

Policies and ethics

Profiles

  1. Yaohui Jin