Sound is an important part of Multimodal cognition. For any system—whether a voice assistant, a next-generation security monitoring device, or an autonomous agent—to behave normally, it must demonstrate a full range of hearing capabilities. These abilities include copying, classifying, retrieving, reasoning, segmenting, assembling, rearranging, and reconstructing.
These various functions depend on converting raw audio into an intermediate representation, or… Embedding. But research on improving the auditory capabilities of multimodal perception models has been fragmented, and important unanswered questions remain: How can we compare performance across domains such as human speech and bioacoustics? What is it TRUE What performance potential are we leaving on the table? Could the inclusion of a single general-purpose vote serve as the foundation for all these capabilities?
To investigate these queries and accelerate progress toward powerful voice machine intelligence, we created Massive audio inclusion standard (MSEB), introduced in NEUREPS 2025.
MSEB provides the structure to answer these questions by:
- Standardizing assessment of a comprehensive set of eight real-world capabilities that we believe every human-like intelligent system should possess.
- Providing an open and extensible framework that allows researchers to seamlessly integrate and evaluate any type of model – from traditional unimodal models to sequential models to comprehensive multimodal embedding models.
- Establish clear performance objectives to objectively highlight research opportunities beyond current state-of-the-art methods.
Our preliminary experiments confirm that current phonological representations are far from universal, and reveal “large room” for performance (i.e., maximum possible improvement) across all eight tasks.







