| A paper on Visual-Question Answering by Mingrui Lao has been accepted in one of the top neural network journals in the world.
Multi-Stage Hybrid Embedding Fusion Network for Visual Question Answering
Multimodal fusion is a crucial component of Visual Question Answering
(VQA), which involves joint understanding and semantic integration be-
tween visual and textual information. Existing VQA learning frameworks
focus mainly on Latent Embedding Fusion (LEF) method, by projecting vi-
sual and textual features into a common latent space, and fusing them with
element-wise multiplication. In this paper, we intend to achieve multiple
and ﬁne-grained multimodal interactions for enhancing fusion performance.
To this end, we propose a Multi-stage Hybrid Embedding Fusion (MHEF)
network to fulﬁll our improvements in two folds: First, we introduce a Dual
Embedding Fusion (DEF) approach that transforms one modal input into
the reciprocal embedding space before integration, and the DEF is further
incorporated with the LEF to form a novel Hybrid Embedding Fusion (HEF).
Second, we design a Multi-stage Fusion Structure (MFS) for the HEF to form
the MHEF network, so as to obtain diverse and profound fusion features for
answer prediction. By jointly training the multi-stage framework, we can
not only improve the performance in each single stage, but also get further
accuracy boost when integrating all prediction results from each stage. Ex-
tensive experiments verify both our proposed HEF and MFS are beneﬁcial
to multi-modal fusion. The full MHEF model outperforms the base-
line LEF model with 2% accuracy boosts, and achieves promising
performance on the VQA-v1 and VQA-v2 datasets.