[Research Seminar] Foundation of Mixture of Experts in Complex and Massive AI Models
About the Talk:
Since the release of the original Transformer model, extensive efforts have been devoted to scaling up the model complexities to take advantage of massive datasets and advanced computing resources. To go beyond simply increasing the network depth and width, Sparse Mixture-of-experts (SMoE) has risen as an appealing solution for scaling Large Language Models. By modularizing the network and activating only subsets of experts per input, SMoE offers constant computational costs while scaling up the model complexity, which often results in improved performances. Despite the initial success, effective SMoE training has been well-known to be notoriously challenging because of the representation collapse issue where all experts converge to learn similar representations or all tokens are only routed to a few experts. As a result, SMoE often suffers from limited representation capabilities and wasteful parameter usage. In this talk, to address its core challenge of representation collapse, we propose a novel competition mechanism for training SMoE, which enjoys the same convergence rate as the optimal estimator in hindsight. Second, we develop CompeteSMoE, a scalable and effective training strategy for SMoE training via competition. CompeteSMoE employs a router trained to predict the competition outcome in a scheduled manner. Thus, the router can learn high quality routing policy that are relevant to the current task. Lastly, we conduct extensive experiments to demonstrate strong learning capabilities of CompeteSMoE and show its promising scalability to large scale architectures.
In the second part of the talk, we introduce a novel mixture-of-experts (MoE) framework, which we call FuseMoE, for handling a variable number of input modalities, which has remained an open challenge in multimodal fusion due to challenges with scalability and lack of a unified approaches for addressing missing modalities. FuseMoE incorporates sparsely gated MoE layers in its fusion component, which are adept at managing distinct tasks and learning optimal modality partitioning. In addition, FuseMoE surpasses previous transformer-based methods in scalability, accommodating an unlimited array of input modalities. Furthermore, FuseMoE routes each modality to designated experts that specialize in those specific data types. This allows FuseMoE to adeptly handle scenarios with missing modalities by dynamically adjusting the influence of experts primarily responsible for the absent data, while still utilizing the available modalities. Lastly, another key innovation in FuseMoE is the integration of a novel Laplace gating function, which not only theoretically ensures better convergence rates compared to Softmax functions, but also demonstrates better predictive performance. We demonstrate that our approach shows superior ability, as compared to existing methods, to integrate diverse input modality types with varying missingness and irregular sampling on three challenging ICU prediction tasks.
About the Speaker:
Nhat Ho is currently an Assistant Professor of Data Science and Statistics at the University of Texas at Austin. He is a core member of the University of Texas Austin Machine Learning Laboratory and senior personnel of the Institute for Foundations of Machine Learning. He is currently associate editor and area chair of several prestigious journals and conferences in machine learning and statistics. His current research focuses on the interplay of four principles of machine learning and data science: interpretability of models (deep generative models, convolutional neural networks, Transformer, model misspecification), stability, and scalability of optimization and sampling algorithms (computational optimal transport, (non)-convex optimization in statistical settings, sampling and variational inference, federated learning), and heterogeneity of data (Bayesian nonparametrics, mixture and hierarchical models).