Explore the Limits of Omni-modal Pretraining at Scale

How the human brain performs coherent multimodal cognition?

As outlined in Richard Mayer's Cognitive Theory of Multimedia Learning,our brain processes multimedia signals through two distinct channels—auditory and visual—in sensory memory, as depicted in Figure(a). The sensory memory integrates these signals with prior knowledge through words, transforming new multimedia information into long-term memory. Notably, 1) multimedia signals in the brain share channels, and 2) words function as the reasoning interface in our brain.
Inspired by these insights, we categorize diverse modalities into two types: knowledge modality and interface modality. Knowledge modalities, primarily derived from raw sensors, contribute knowledge in diverse formats. For example, images and depth maps offer visual knowledge, while audio and video provide auditory and spatiotemporal knowledge. The language modality, developed by humans, is inherently more abstract and naturally functions as the interface modality, facilitating learning, reasoning, and the coordination of knowledge. To this end, we design an omni-modal learning architecture, illustrated in Figure (b), with two distinct branches: one for knowledge modalities and one for the interface modality, i.e. natural language. The knowledge and interface modalities are aligned through a novel generative reasoning method.

Our Contributions

1. We collect an omni-modal dataset including comprehensive paired video, audio, visual caption, audio caption, estimated depth and normal maps. And we are working on preparing more paired modalitie.

2. We propose a scalable pretraining paradigm, Multimodal Context (MiCo), which enpowers models learn universal representations across any modalities.

3. We conduct extensive experiments on 10 single-modality tasks, 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and 18 MLLM benchmarks. Our models achieve 37 new state-of-the-art performances.

In brief, it's the beginning to explore multimodal scaling laws, use MiCo-ViT with LLMs, outperforms LLaVA-1.5 in vision and almost any MLLMs in video, audio, and 3D zero-shot QA tasks.

Evolution of Pretraining Paradigms

Masked modeling has shown great success in singlemodality general-purpose understanding. Contrastive learning distinguishes transferable features with modality tuples. We aim to achieve general-purpose omni-modal understanding and learn transferable, universal representations.

Method: Multimodal Context

Overview of Multimodal Context Pretraining Paradigm We use a shared ViT for multimodal feature extraction, and another branch is to employ a text encoder. We concatenate these multimodal sequences as multimodal contexts and perform contrastive learning and masked modeling.

Multimodal Context Construction (See paper for exact details)

As mentioned in § 3.1, we build a dataset with multimodal paired data {(x_I, x_D, x_N, x_T^I), (x_A, x_T^A), (x_V, x_T^V)}, then we employ the omni-modal encoder f(•; θ) to extract features z_I, z_A, z_V, z_D, and z_N, then use the text encoder to extract text features z_T. Therefore, we construct the context by a top-down design:

For the whole multimodal embeddings, they share the same position embeddings E_Pos to build a modality-fused context relationship across diverse modalities.
Then, for each specific context, they're labeled by modality embeddings including E_M^I, E_M^A, E_M^V, E_M^D, E_M^N, etc., to indicate modality types.
Within the same modality context, we employ the context embeddings E_C^I to construct uni-modal context relationships.

Thus, the construction of the multimodal context can be formulated as:

z_I = [z_I¹, z_I², ..., z_I^L_I] + E_C^I, for each modality,

z = [z_I + E_M^I, z_A + E_M^A, z_V + E_M^V, z_D + E_M^D, z_N + E_M^N] + E_Pos,

where E_C^I is up to the sample length of a specific modality. Meanwhile, the text features of specific captions can be easily concatenated, where their position embeddings E_Pos^′ are also shared:

z_T = [z_T^I, z_T^A, z_T^V] + E_Pos^′.

Experiment

We evaluate our model on three different benchmarks: 1) Single-modality Understanding (following previous practices in fine-tuning & zero-shot setting in classification and forecasting tasks), 2) Cross-modality Understanding (following BEiT-3, VAST in finetuning and dataset splits for Caption, QA, and retrieval tasks), and 3) Multimodal Understanding with Large Language Models (following LLava, VideoChat, and OneLLM in multimodal zero-shot QA)

Extensive experiments on 10 single-modality tasks, 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and 18 MLLM benchmarks present the superiority of MiCo. Our models achieve 37 new state-of-the-art performances.

@article{zhang2024explore, title={Explore the Limits of Omni-modal Pretraining at Scale}, author={Zhang, Yiyuan and Li, Handong and Liu, Jing and Yue, Xiangyu}, journal={arXiv preprint arXiv:2406.09412}, year={2024} }