As outlined in Richard Mayer's Cognitive Theory of Multimedia Learning,our brain processes multimedia signals through two distinct channels—auditory and visual—in sensory memory, as depicted in Figure(a). The sensory memory integrates these signals with prior knowledge through words, transforming new multimedia information into long-term memory. Notably, 1) multimedia signals in the brain share channels, and 2) words function as the reasoning interface in our brain.
Inspired by these insights, we categorize diverse modalities into two types: knowledge modality and interface modality. Knowledge modalities, primarily derived from raw sensors, contribute knowledge in diverse formats. For example, images and depth maps offer visual knowledge, while audio and video provide auditory and spatiotemporal knowledge. The language modality, developed by humans, is inherently more abstract and naturally functions as the interface modality, facilitating learning, reasoning, and the coordination of knowledge. To this end, we design an omni-modal learning architecture, illustrated in Figure (b), with two distinct branches: one for knowledge modalities and one for the interface modality, i.e. natural language. The knowledge and interface modalities are aligned through a novel generative reasoning method.
Masked modeling has shown great success in singlemodality general-purpose understanding. Contrastive learning distinguishes transferable features with modality tuples. We aim to achieve general-purpose omni-modal understanding and learn transferable, universal representations.
Overview of Multimodal Context Pretraining Paradigm We use a shared ViT for multimodal feature extraction, and another branch is to employ a text encoder. We concatenate these multimodal sequences as multimodal contexts and perform contrastive learning and masked modeling.
As mentioned in § 3.1, we build a dataset with multimodal paired data {(xI, xD, xN, xTI), (xA, xTA), (xV, xTV)}, then we employ the omni-modal encoder f(•; θ) to extract features zI, zA, zV, zD, and zN, then use the text encoder to extract text features zT. Therefore, we construct the context by a top-down design:
Thus, the construction of the multimodal context can be formulated as:
zI = [zI1, zI2, ..., zILI] + ECI, for each modality,
z = [zI + EMI, zA + EMA, zV + EMV, zD + EMD, zN + EMN] + EPos,
where ECI is up to the sample length of a specific modality. Meanwhile, the text features of specific captions can be easily concatenated, where their position embeddings EPos′ are also shared:
zT = [zTI, zTA, zTV] + EPos′.
We evaluate our model on three different benchmarks: 1) Single-modality Understanding (following previous practices in fine-tuning & zero-shot setting in classification and forecasting tasks), 2) Cross-modality Understanding (following BEiT-3, VAST in finetuning and dataset splits for Caption, QA, and retrieval tasks), and 3) Multimodal Understanding with Large Language Models (following LLava, VideoChat, and OneLLM in multimodal zero-shot QA)
@article{zhang2024explore,
title={Explore the Limits of Omni-modal Pretraining at Scale},
author={Zhang, Yiyuan and Li, Handong and Liu, Jing and Yue, Xiangyu},
journal={arXiv preprint arXiv:2406.09412},
year={2024}
}