Explore the Limits of Omni-modal Pretraining at Scale

1Multimedia Lab, The Chinese University of Hong Kong
2School of Artificial Intelligence, University of Chinese Academy of Sciences

3 Institute of Automation, Chinese Academy of Science 4 Shanghai AI Laboratory
Teaser

Omni-modal Pretraining. We propose collecting large-scale omni-modal paired data, including text, image, video, depth, and normal maps, to learn universal representations.

How the human brain performs coherent multimodal cognition?

idea

As outlined in Richard Mayer's Cognitive Theory of Multimedia Learning,our brain processes multimedia signals through two distinct channels—auditory and visual—in sensory memory, as depicted in Figure(a). The sensory memory integrates these signals with prior knowledge through words, transforming new multimedia information into long-term memory. Notably, 1) multimedia signals in the brain share channels, and 2) words function as the reasoning interface in our brain.
Inspired by these insights, we categorize diverse modalities into two types: knowledge modality and interface modality. Knowledge modalities, primarily derived from raw sensors, contribute knowledge in diverse formats. For example, images and depth maps offer visual knowledge, while audio and video provide auditory and spatiotemporal knowledge. The language modality, developed by humans, is inherently more abstract and naturally functions as the interface modality, facilitating learning, reasoning, and the coordination of knowledge. To this end, we design an omni-modal learning architecture, illustrated in Figure (b), with two distinct branches: one for knowledge modalities and one for the interface modality, i.e. natural language. The knowledge and interface modalities are aligned through a novel generative reasoning method.

Our Contributions

1. We collect an omni-modal dataset including comprehensive paired video, audio, visual caption, audio caption, estimated depth and normal maps. And we are working on preparing more paired modalitie.
2. We propose a scalable pretraining paradigm, Multimodal Context (MiCo), which enpowers models learn universal representations across any modalities.
3. We conduct extensive experiments on 10 single-modality tasks, 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and 18 MLLM benchmarks. Our models achieve 37 new state-of-the-art performances.
In brief, it's the beginning to explore multimodal scaling laws, use MiCo-ViT with LLMs, outperforms LLaVA-1.5 in vision and almost any MLLMs in video, audio, and 3D zero-shot QA tasks.

idea

Evolution of Pretraining Paradigms


MY ALT TEXT

Masked modeling has shown great success in singlemodality general-purpose understanding. Contrastive learning distinguishes transferable features with modality tuples. We aim to achieve general-purpose omni-modal understanding and learn transferable, universal representations.


Method: Multimodal Context

MY ALT TEXT

Overview of Multimodal Context Pretraining Paradigm We use a shared ViT for multimodal feature extraction, and another branch is to employ a text encoder. We concatenate these multimodal sequences as multimodal contexts and perform contrastive learning and masked modeling.

Multimodal Context Construction (See paper for exact details)

As mentioned in § 3.1, we build a dataset with multimodal paired data {(xI, xD, xN, xTI), (xA, xTA), (xV, xTV)}, then we employ the omni-modal encoder f(•; θ) to extract features zI, zA, zV, zD, and zN, then use the text encoder to extract text features zT. Therefore, we construct the context by a top-down design:

  1. For the whole multimodal embeddings, they share the same position embeddings EPos to build a modality-fused context relationship across diverse modalities.
  2. Then, for each specific context, they're labeled by modality embeddings including EMI, EMA, EMV, EMD, EMN, etc., to indicate modality types.
  3. Within the same modality context, we employ the context embeddings ECI to construct uni-modal context relationships.

Thus, the construction of the multimodal context can be formulated as:

zI = [zI1, zI2, ..., zILI] + ECI,      for each modality,

z = [zI + EMI, zA + EMA, zV + EMV, zD + EMD, zN + EMN] + EPos,

where ECI is up to the sample length of a specific modality. Meanwhile, the text features of specific captions can be easily concatenated, where their position embeddings EPos are also shared:

zT = [zTI, zTA, zTV] + EPos.


Experiment

We evaluate our model on three different benchmarks: 1) Single-modality Understanding (following previous practices in fine-tuning & zero-shot setting in classification and forecasting tasks), 2) Cross-modality Understanding (following BEiT-3, VAST in finetuning and dataset splits for Caption, QA, and retrieval tasks), and 3) Multimodal Understanding with Large Language Models (following LLava, VideoChat, and OneLLM in multimodal zero-shot QA)

exp-singel-modality
exp-singel-modality
exp-singel-modality
exp-singel-modality
exp-singel-modality
Extensive experiments on 10 single-modality tasks, 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and 18 MLLM benchmarks present the superiority of MiCo. Our models achieve 37 new state-of-the-art performances.

BibTeX

If you find our work useful, please cite our paper. BibTex code is provided below:
@article{zhang2024explore,
        title={Explore the Limits of Omni-modal Pretraining at Scale},
        author={Zhang, Yiyuan and Li, Handong and Liu, Jing and Yue, Xiangyu},
        journal={arXiv preprint arXiv:2406.09412},
        year={2024}
}