ASLP-lab/OSUM · Apply for community grant: Academic project (gpu)

We have conducted the following work: OSUM (Open Speech Understanding Model), a novel approach to developing Speech Understanding Language Models (SULMs) with limited academic resources. OSUM aims to bridge the gap between industry-scale models and academic research by providing a transparent and efficient framework for training SULMs. It utilizes tens of thousands of hours of multi-task data and employs a multi-stage training process to develop a model capable of multi-level audio understanding. This model lays the foundation for the eventual realization of comprehensive audio understanding capabilities.

We hope to have a public demo page where people can intuitively experience the performance of OSUM. Its inference requires approximately 18GB of GPU memory, as its base model is a 7B-parameter LLM. For more detailed information, please refer to the following paper: https://arxiv.org/abs/2501.13306.

Another contribution of this project is the comprehensive open-sourcing of training and inference code to facilitate further development in this field. The official code repository can be found at: https://github.com/ASLP-lab/OSUM.