{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T21:18:31Z","timestamp":1762809511145,"version":"3.41.0"},"reference-count":74,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,12,11]],"date-time":"2023-12-11T00:00:00Z","timestamp":1702252800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62106247"],"award-info":[{"award-number":["62106247"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,4,30]]},"abstract":"<jats:p>Video question answering (VideoQA) is challenging as it requires reasoning about natural language and multimodal interactive relations. Most existing methods apply attention mechanisms to extract interactions between the question and the video or to extract effective spatio-temporal relational representations. However, these methods neglect the implication of relations between intra- and inter-modal interactions for multimodal learning, and they fail to fully exploit the synergistic effect of multiscale semantics in answer reasoning. In this article, we propose a novel hierarchical synergy-enhanced multimodal relational network (HMRNet) to address these issues. Specifically, we devise (i) a compact and unified relation-oriented interaction module that explores the relation between intra- and inter-modal interactions to enable effective multimodal learning; and (ii) a hierarchical synergistic memory unit that leverages a memory-based interaction scheme to complement and fuse multimodal semantics at multiple scales to achieve synergistic enhancement of answer reasoning. With careful design of each component, our HMRNet has fewer parameters and is computationally efficient. Extensive experiments and qualitative analyses demonstrate that the HMRNet is superior to previous state-of-the-art methods on eight benchmark datasets. We also demonstrate the effectiveness of the different components of our method.<\/jats:p>","DOI":"10.1145\/3630101","type":"journal-article","created":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T21:38:30Z","timestamp":1698269910000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question Answering"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7445-5567","authenticated-orcid":false,"given":"Min","family":"Peng","sequence":"first","affiliation":[{"name":"Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, China and Chongqing School, University of Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1141-6020","authenticated-orcid":false,"given":"Xiaohu","family":"Shao","sequence":"additional","affiliation":[{"name":"Beijing IDRIVERPLUS Technology Co., Ltd, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9117-8282","authenticated-orcid":false,"given":"Yu","family":"Shi","sequence":"additional","affiliation":[{"name":"Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4451-5327","authenticated-orcid":false,"given":"Xiangdong","family":"Zhou","sequence":"additional","affiliation":[{"name":"Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,12,11]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i8.16822"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_1_4_2","first-page":"103","volume-title":"Proceedings of SSST-8, 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation","author":"Cho Kyunghyun","year":"2014","unstructured":"Kyunghyun Cho, Bart van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder\u2013decoder approaches. In Proceedings of SSST-8, 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation. ACL, Doha, Qatar, 103\u2013111."},{"key":"e_1_3_1_5_2","first-page":"636","volume-title":"Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI-21","author":"Dang Long Hoang","year":"2021","unstructured":"Long Hoang Dang, Thao Minh Le, Vuong Le, and Truyen Tran. 2021. Hierarchical object-oriented spatio-temporal reasoning for video question answering. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI-21. IJCAI, Montreal, Canada, 636\u2013642."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00210"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00688"},{"key":"e_1_3_1_8_2","doi-asserted-by":"crossref","unstructured":"Lianli Gao Tangming Chen Xiangpeng Li Pengpeng Zeng Lei Zhao and Yuan-Fang Li. 2021. Generalized pyramid co-attention with learnable aggregation net for video question answering. Pattern Recognition 120 C (2021) 108145.","DOI":"10.1016\/j.patcog.2021.108145"},{"key":"e_1_3_1_9_2","doi-asserted-by":"crossref","unstructured":"Lianli Gao Yu Lei Pengpeng Zeng Jingkuan Song Meng Wang and Heng Tao Shen. 2021. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Transactions on Image Processing 31 (2021) 202\u2013215.","DOI":"10.1109\/TIP.2021.3120867"},{"key":"e_1_3_1_10_2","doi-asserted-by":"crossref","unstructured":"Lianli Gao Xuanhan Wang Jingkuan Song and Yang Liu. 2020. Fused GRU with semantic-temporal attention for video captioning. Neurocomputing 395 (2020) 222\u2013228.","DOI":"10.1016\/j.neucom.2018.06.096"},{"key":"e_1_3_1_11_2","first-page":"6391","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Gao Lianli","year":"2019","unstructured":"Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019. Structured two-stream attention network for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 6391\u20136398."},{"key":"e_1_3_1_12_2","doi-asserted-by":"crossref","unstructured":"Mao Gu Zhou Zhao Weike Jin Richang Hong and Fei Wu. 2021. Graph-based multi-interaction network for video question answering. IEEE Transactions on Image Processing 30 (2021) 2758\u20132770.","DOI":"10.1109\/TIP.2021.3051756"},{"key":"e_1_3_1_13_2","first-page":"973","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing","author":"Guo Zhicheng","year":"2021","unstructured":"Zhicheng Guo, Jiaxuan Zhao, Licheng Jiao, Xu Liu, and Lingling Li. 2021. Multi-scale progressive attention network for video question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ACL, Bangkok, Thailand, 973\u2013978."},{"key":"e_1_3_1_14_2","doi-asserted-by":"crossref","unstructured":"Zhaoyu Guo Zhou Zhao Weike Jin Zhicheng Wei Min Yang Nannan Wang and Nicholas Jing Yuan. 2021. Multi-turn video question generation via reinforced multi-choice attention network. IEEE Transactions on Circuits and Systems for Video Technology 31 5 (2021) 1697\u20131710.","DOI":"10.1109\/TCSVT.2020.3014775"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00685"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6737"},{"key":"e_1_3_1_18_2","doi-asserted-by":"crossref","unstructured":"Yunseok Jang Yale Song Chris Dongjoo Kim Youngjae Yu Youngjin Kim and Gunhee Kim. 2019. Video question answering with spatio-temporal reasoning. International Journal of Computer Vision 127 10 (2019) 1385\u20131412.","DOI":"10.1007\/s11263-019-01189-x"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.149"},{"key":"e_1_3_1_20_2","first-page":"11101","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Jiang Jianwen","year":"2020","unstructured":"Jianwen Jiang, Ziqiang Chen, Haojie Lin, Xibin Zhao, and Yue Gao. 2020. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 11101\u201311108."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6767"},{"key":"e_1_3_1_22_2","doi-asserted-by":"crossref","unstructured":"Weike Jin Zhou Zhao Xiaochun Cao Jieming Zhu Xiuqiang He and Yueting Zhuang. 2021. Adaptive spatio-temporal graph enhanced vision-language representation for video QA. IEEE Transactions on Image Processing 30 (2021) 5477\u20135489.","DOI":"10.1109\/TIP.2021.3076556"},{"key":"e_1_3_1_23_2","doi-asserted-by":"crossref","unstructured":"Weike Jin Zhou Zhao Yimeng Li Jie Li Jun Xiao and Yueting Zhuang. 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Transactions on Multimedia Computing Communications and Applications 15 2s Article 52(2019) 22 pages. 52","DOI":"10.1145\/3321505"},{"key":"e_1_3_1_24_2","first-page":"4171","volume-title":"Proceedings of the NAACL-HLT","author":"Kenton Jacob Devlin Ming-Wei Chang","year":"2019","unstructured":"Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT. ACL, 4171\u20134186."},{"key":"e_1_3_1_25_2","first-page":"673","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Kim Kyung-Min","year":"2018","unstructured":"Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. 2018. Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision. ECCV, Munich, Germany, 673\u2013688."},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00999"},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","unstructured":"Thao Minh Le Vuong Le Svetha Venkatesh and Truyen Tran. 2021. Hierarchical conditional relation networks for multimodal video question answering. International Journal of Computer Vision 129 11 (2021) 3027\u20133050.","DOI":"10.1007\/s11263-021-01514-3"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00725"},{"key":"e_1_3_1_29_2","first-page":"1369","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Lei Jie","year":"2018","unstructured":"Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. TVQA: Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL, Brussels, Belgium, 1369\u20131379."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350971"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018658"},{"key":"e_1_3_1_32_2","doi-asserted-by":"crossref","unstructured":"Xinrui Li Aming Wu and Yahong Han. 2022. Complementary spatiotemporal network for video question answering. Multimedia Systems 28 1 (2022) 161\u2013169.","DOI":"10.1007\/s00530-021-00805-6"},{"key":"e_1_3_1_33_2","first-page":"6135","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Liang Junwei","year":"2018","unstructured":"Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander G. Hauptmann. 2018. Focal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 6135\u20136143."},{"key":"e_1_3_1_34_2","first-page":"3","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV \u201918)","author":"Lin Tianwei","year":"2018","unstructured":"Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV \u201918). ECCV, Munich, Germany, 3\u201319."},{"key":"e_1_3_1_35_2","first-page":"1698","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Liu Fei","year":"2021","unstructured":"Fei Liu, Jing Liu, Weining Wang, and Hanqing Lu. 2021. HAIR: Hierarchical visual-semantic relational reasoning for video question answering. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. ICCV, Online, 1698\u20131707."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413649"},{"key":"e_1_3_1_37_2","doi-asserted-by":"crossref","unstructured":"Yun Liu Xiaoming Zhang Feiran Huang Shixun Shen Peng Tian Lang Li and Zhoujun Li. 2022. Dynamic self-attention with vision synchronization networks for video question answering. Pattern Recognition 132 C (2022) 108959.","DOI":"10.1016\/j.patcog.2022.108959"},{"key":"e_1_3_1_38_2","unstructured":"Xiang Long Gerard de Melo Dongliang He Fu Li Zhizhen Chi Shilei Wen and Chuang Gan. 2022. Purely attention based local feature integration for video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 4 (2022) 2140\u20132154."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-5010"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01527"},{"key":"e_1_3_1_41_2","first-page":"1276","volume-title":"Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI \u201922)","author":"Peng Min","year":"2022","unstructured":"Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, and Xiang-Dong Zhou. 2022. Multilevel hierarchical network with multiscale sampling for video question answering. In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI \u201922). IJCAI, Messe Wien, Vienna, Austria, 1276\u20131282."},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_1_43_2","unstructured":"Mengye Ren Ryan Kiros and Richard Zemel. 2015. Exploring models and data for image question answering. in Neural Information Processing Systems 2 (2015) 2953\u20132961."},{"key":"e_1_3_1_44_2","doi-asserted-by":"crossref","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 6 (2016) 1137\u20131149.","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"e_1_3_1_45_2","first-page":"6167","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing","author":"Seo Ahjeong","year":"2021","unstructured":"Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang. 2021. Attend what you need: Motion-appearance synergistic networks for video question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ACL, Online, 6167\u20136177."},{"key":"e_1_3_1_46_2","first-page":"16877","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Seo Paul Hongsuck","year":"2021","unstructured":"Paul Hongsuck Seo, Arsha Nagrani, and Cordelia Schmid. 2021. Look before you speak: Visually contextualized utterances. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. CVPR, Online, 16877\u201316887."},{"key":"e_1_3_1_47_2","first-page":"5998","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., 5998\u20136008."},{"key":"e_1_3_1_48_2","first-page":"7380","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Wang Bo","year":"2018","unstructured":"Bo Wang, Youjiang Xu, Yahong Han, and Richang Hong. 2018. Movie question answering: Remembering the textual cues for layered visual contents. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 7380\u20137387."},{"key":"e_1_3_1_49_2","doi-asserted-by":"crossref","unstructured":"Jianyu Wang Bing-Kun Bao and Changsheng Xu. 2022. DualVGR: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia 24 (2022) 3369\u20133380.","DOI":"10.1109\/TMM.2021.3097171"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00965"},{"key":"e_1_3_1_51_2","first-page":"2804","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Xiao Junbin","year":"2022","unstructured":"Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022. Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, Vancouver, Canada, 2804\u20132812."},{"key":"e_1_3_1_52_2","first-page":"39","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Xiao Junbin","year":"2022","unstructured":"Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. 2022. Video graph transformer for video question answering. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 39\u201358."},{"key":"e_1_3_1_53_2","doi-asserted-by":"crossref","first-page":"8188","DOI":"10.18653\/v1\/2022.emnlp-main.561","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Xiao Shaoning","year":"2022","unstructured":"Shaoning Xiao, Long Chen, Kaifeng Gao, Zhao Wang, Yi Yang, Zhimeng Zhang, and Jun Xiao. 2022. Rethinking multi-modal alignment in multi-choice VideoQA from feature and sample perspectives. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. ACL, Abu Dhabi, United Arab Emirates, 8188\u20138198."},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i4.16406"},{"key":"e_1_3_1_55_2","doi-asserted-by":"crossref","unstructured":"Shaoning Xiao Yimeng Li Yunan Ye Long Chen Shiliang Pu Zhou Zhao Jian Shao and Jun Xiao. 2020. Hierarchical temporal fusion of multi-grained attention features for video question answering. Neural Processing Letters 52 2 (2020) 993\u20131003.","DOI":"10.1007\/s11063-019-10003-1"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123427"},{"key":"e_1_3_1_57_2","first-page":"9878","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Xu Li","year":"2021","unstructured":"Li Xu, He Huang, and Jun Liu. 2021. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. CVPR, Online, 9878\u20139888."},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00498"},{"key":"e_1_3_1_59_2","doi-asserted-by":"crossref","unstructured":"Hongyang Xue Zhou Zhao and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing 26 12 (2017) 5656\u20135666.","DOI":"10.1109\/TIP.2017.2746267"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00171"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093596"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080655"},{"key":"e_1_3_1_63_2","doi-asserted-by":"crossref","unstructured":"Ting Yu Jun Yu Zhou Yu Qingming Huang and Qi Tian. 2021. Long-term video question answering via multimodal hierarchical memory attentive networks. IEEE Transactions on Circuits and Systems for Video Technology 31 3 (2021) 931\u2013944.","DOI":"10.1109\/TCSVT.2020.2995959"},{"key":"e_1_3_1_64_2","doi-asserted-by":"crossref","unstructured":"Ting Yu Jun Yu Zhou Yu and Dacheng Tao. 2019. Compositional attention networks with two-stream fusion for video question answering. IEEE Transactions on Image Processing 29 (2019) 1204\u20131218.","DOI":"10.1109\/TIP.2019.2940677"},{"key":"e_1_3_1_65_2","first-page":"26462","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Yu Weijiang","year":"2021","unstructured":"Weijiang Yu, Haoteng Zheng, Mengfei Li, Lei Ji, Lijun Wu, Nong Xiao, and Nan Duan. 2021. Learning from inside: Self-driven siamese sampling and reasoning for video question answering. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Vancouver, Canada, 26462\u201326474."},{"key":"e_1_3_1_66_2","first-page":"9127","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Yu Zhou","year":"2019","unstructured":"Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 9127\u20139134."},{"key":"e_1_3_1_67_2","first-page":"8807","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zadeh Amir","year":"2019","unstructured":"Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. 2019. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. CVPR, 8807\u20138817."},{"key":"e_1_3_1_68_2","first-page":"23634","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Zellers Rowan","year":"2021","unstructured":"Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. 2021. MERLOT: Multimodal neural script knowledge models. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Vancouver, Canada, 23634\u201323651."},{"key":"e_1_3_1_69_2","doi-asserted-by":"crossref","unstructured":"Pengpeng Zeng Haonan Zhang Lianli Gao Jingkuan Song and Heng Tao Shen. 2022. Video question answering with prior knowledge and object-sensitive learning. IEEE Transactions on Image Processing 31 (2022) 5936\u20135948.","DOI":"10.1109\/TIP.2022.3205212"},{"key":"e_1_3_1_70_2","doi-asserted-by":"crossref","unstructured":"Zheng-Jun Zha Jiawei Liu Tianhao Yang and Yongdong Zhang. 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Transactions on Multimedia Computing Communications and Applications 15 2s Article 53(2019) 18 pages. 53","DOI":"10.1145\/3320061"},{"key":"e_1_3_1_71_2","unstructured":"Hao Zhang Aixin Sun Wei Jing Liangli Zhen Joey Tianyi Zhou and Rick Siow Mong Goh. 2021. Natural language video localization: A revisit in span-based question answering framework. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 8 (2021) 4252\u20134266."},{"key":"e_1_3_1_72_2","doi-asserted-by":"crossref","unstructured":"Haonan Zhang Pengpeng Zeng Yuxuan Hu Jin Qian Jingkuan Song and Lianli Gao. 2023. Learning visual question answering on controlled semantic noisy labels. Pattern Recognition 138 C (2023) 109339.","DOI":"10.1016\/j.patcog.2023.109339"},{"key":"e_1_3_1_73_2","doi-asserted-by":"crossref","unstructured":"Jipeng Zhang Jie Shao Rui Cao Lianli Gao Xing Xu and Heng Tao Shen. 2022. Action-centric relation transformer network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology 32 1 (2022) 63\u201374.","DOI":"10.1109\/TCSVT.2020.3048440"},{"key":"e_1_3_1_74_2","doi-asserted-by":"crossref","unstructured":"Zhou Zhao Zhu Zhang Shuwen Xiao Zhenxin Xiao Xiaohui Yan Jun Yu Deng Cai and Fei Wu. 2019. Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing 28 12 (2019) 5939\u20135952.","DOI":"10.1109\/TIP.2019.2922062"},{"key":"e_1_3_1_75_2","doi-asserted-by":"crossref","unstructured":"Yueting Zhuang Dejing Xu Xin Yan Wenzhuo Cheng Zhou Zhao Shiliang Pu and Jun Xiao. 2020. Multichannel attention refinement for video question answering. ACM Transactions on Multimedia Computing Communications and Applications 16 1s Article 24(2020) 23 pages. 24","DOI":"10.1145\/3366710"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3630101","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3630101","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:50:56Z","timestamp":1750287056000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3630101"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,11]]},"references-count":74,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,4,30]]}},"alternative-id":["10.1145\/3630101"],"URL":"https:\/\/doi.org\/10.1145\/3630101","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2023,12,11]]},"assertion":[{"value":"2023-05-25","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-10-21","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}