Home News & Events Research Seminars [Research Seminar] Foundation of Mixture of Experts in Complex and Massive AI Models

[Research Seminar] Foundation of Mixture of Experts in Complex and Massive AI Models

About the Talk:

Since the release of the original Transformer model, extensive efforts have been devoted to scaling up the model complexities to take advantage of massive datasets and advanced computing resources. To go beyond simply increasing the network depth and width, Sparse Mixture-of-experts (SMoE) has risen as an appealing solution for scaling Large Language Models. By modularizing the network and activating only subsets of experts per input, SMoE offers constant computational costs while scaling up the model complexity, which often results in improved performances. Despite the initial success, effective SMoE training has been well-known to be notoriously challenging because of the representation collapse issue where all experts converge to learn similar representations or all tokens are only routed to a few experts. As a result, SMoE often suffers from limited representation capabilities and wasteful parameter usage. In this talk, to address its core challenge of representation collapse, we propose a novel competition mechanism for training SMoE, which enjoys the same convergence rate as the optimal estimator in hindsight. Second, we develop CompeteSMoE, a scalable and effective training strategy for SMoE training via competition. CompeteSMoE employs a router trained to predict the competition outcome in a scheduled manner. Thus, the router can learn high quality routing policy that are relevant to the current task. Lastly, we conduct extensive experiments to demonstrate strong learning capabilities of CompeteSMoE and show its promising scalability to large scale architectures.

In the second part of the talk, we introduce a novel mixture-of-experts (MoE) framework, which we call FuseMoE, for handling a variable number of input modalities, which has remained an open challenge in multimodal fusion due to challenges with scalability and lack of a unified approaches for addressing missing modalities. FuseMoE incorporates sparsely gated MoE layers in its fusion component, which are adept at managing distinct tasks and learning optimal modality partitioning. In addition, FuseMoE surpasses previous transformer-based methods in scalability, accommodating an unlimited array of input modalities. Furthermore, FuseMoE routes each modality to designated experts that specialize in those specific data types. This allows FuseMoE to adeptly handle scenarios with missing modalities by dynamically adjusting the influence of experts primarily responsible for the absent data, while still utilizing the available modalities. Lastly, another key innovation in FuseMoE is the integration of a novel Laplace gating function, which not only theoretically ensures better convergence rates compared to Softmax functions, but also demonstrates better predictive performance. We demonstrate that our approach shows superior ability, as compared to existing methods, to integrate diverse input modality types with varying missingness and irregular sampling on three challenging ICU prediction tasks.

About the Speaker: 

Nhat Ho is currently an Assistant Professor of Data Science and Statistics at the University of Texas at Austin. He is a core member of the University of Texas Austin Machine Learning Laboratory and senior personnel of the Institute for Foundations of Machine Learning. He is currently associate editor and area chair of several prestigious journals and conferences in machine learning and statistics. His current research focuses on the interplay of four principles of machine learning and data science: interpretability of models (deep generative models, convolutional neural networks, Transformer, model misspecification), stability, and scalability of optimization and sampling algorithms (computational optimal transport, (non)-convex optimization in statistical settings, sampling and variational inference, federated learning), and heterogeneity of data (Bayesian nonparametrics, mixture and hierarchical models).

Recent Events

Open Innovation Conference
December 06, 2024

Open Innovation Conference

Welcome to the exciting 1st Annual International Open Innovation Conference, hosted by VinUniversity in collaboration with the University of Oxford’s Saïd Business School, the Cornell University’s SC Johnson College of Business, and the Duke University’s Center for International Development. This event will be taken place at VinUniversity on December 6-7, 2024. This year’s conference will explore […]

The 16th Asian Conference on Machine Learning
December 04, 2024

The 16th Asian Conference on Machine Learning

ACML 2024 provides a leading international forum for researchers in machine learning and related fields to share their new ideas, progress and achievements. It will take place in Hanoi, Vietnam from 5-8th December 2024. Read more HERE. Important Dates Conference Track Dates Date Event 03 July 2024 Submission deadline 14 August 2024 Reviews released to […]

[Research Seminar] Scalable Uncertainty: AI Needs to Know What it Doesn’t Know
2:00 - 4:00 PM, July 15, 2024

[Research Seminar] Scalable Uncertainty: AI Needs to Know What it Doesn’t Know

About the Talk: Large models fuel impressive capabilities.  However, they currently do not know what they do not know, which is critical to the important challenges of exploration and alignment.  Knowing what is not known is essential, for example, to gathering informative data.  In this talk, I will discuss the need for scalable uncertainty and […]