audio-ai-hub

The hub for audio AI research.

Papers, open models, benchmarks and datasets across audio LLMs · speech recognition · speech synthesis · music & audio generation.

121entries
11categories
2026-05latest

All entries

BEA-Dialogue+2026-05

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

Dataset ResourceDatasetAudio InHungarian

BEA-Dialogue+ is an expanded conversational Hungarian ASR corpus that relaxes the strictly speaker-disjoint split of BEA-Dialogue while preserving separation of the primary speakers, yielding 200 hours of transcribed natural conversation (up from 85). It enables a controlled study of the trade-off between additional training data and speaker overlap, evaluated with Whisper- and FastConformer-based models.

Chatterbox-Flash2026-05

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

Speech SynthesisModelAudio InAudio Out

Chatterbox-Flash is a zero-shot TTS model created by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. It introduces two inference-time techniques—prior-calibrated scoring and an early-decoding schedule—to counter the long-tail token bias that otherwise degrades parallel decoding quality.

MindVoice2026-05

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

MultimodalModelAudio Out

MindVoice is a neuro-to-speech reconstruction framework that recovers intelligible speech from noisy, spatially-blurred non-invasive neural recordings by leveraging pretrained models to compensate for incomplete semantic and acoustic information. It targets safe, scalable speech brain-computer interfaces, moving past prior methods that produced spectrally-similar but unintelligible output.

SURE2026-05

A Unified and Reproducible Experimentation Framework for Speech Understanding

BenchmarkAudio In

SURE is a unified experimentation framework for speech understanding that standardizes prediction formats, normalization, and scoring to make evaluations comparable across paradigms, from conventional pipelines to Speech LLMs. It adds an agent-assisted training-conversion flow that maps papers and code into versioned, runnable training pipelines on matched open-data subsets.

SwanSphere2026-05

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Audio GenerationModelAudio Out

SwanSphere is a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. It uses a causal autoregressive diffusion transformer for low-latency streaming synthesis and a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with acoustic spatial cues.

UNISON2026-05

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

Audio GenerationModelAudio InAudio Out

UNISON is a latent diffusion framework that unifies speech generation, sound generation, and audio editing in a single set of weights, covering text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, and scene-level/timed editing. It uses layer-wise deep LLM fusion, injecting hidden states from a frozen MLLM into corresponding MM-DiT blocks for depth-matched semantic conditioning.

UniAudio-Token2026-05

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Model and MethodsSpeech TokenizerAudio In

UniAudio-Token augments single-codebook semantic speech tokenizers with general audio perception without sacrificing speech ability, addressing the 'acoustic blindness' of linguistically-focused tokenizers. It introduces Semantic-Acoustic Primitives (SAP) that decompose audio into linguistic content, vocal attributes, and auditory-scene primitives, plus a content-aware Semantic-Acoustic Equilibrium (SAE) gating mechanism.

MiniCPM-o2026-04

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

ModelBest (OpenBMB), Tsinghua University

MultimodalOmni-Modal LLMAudio InAudio OutMultilingual★ 25,457

MiniCPM-o 4.5 is OpenBMB's compact (8B-class) full-duplex omni-modal LLM supporting real-time vision, speech, and text interaction with low-latency streaming TTS, designed for on-device and edge deployment.

Fun-ASR-Nano2025-12

Fun-ASR Technical Report

FunAudioLLM Team, Tongyi Lab, Alibaba Group

Model and MethodsSpeech Recognition ModelAudio InMultilingual★ 1,197

End-to-end LLM-based ASR (SenseVoice Encoder + Transformer Adaptor + Qwen3-0.6B LLM + CTC Decoder) from the FunAudioLLM team. Trained on tens of millions of hours of real speech, supports 31 languages, 7 Chinese dialects, 26 regional accents, lyrics recognition, hotwords, timestamps, and speaker diarization. Streaming inference accelerated via vLLM (up to 393x realtime).

Qwen3-Omni2025-09

Qwen3-Omni Technical Report

Qwen Team, Alibaba Group

Model and MethodsOmni-Modal LLMAudio InAudio OutMultilingual★ 3,798

Qwen3-Omni is the third-generation omni-modal LLM from Alibaba, scaling up the Thinker-Talker design with stronger multilingual ASR, audio understanding, and real-time speech generation across 100+ input and 30+ output languages.

ACORN2025-07

Teaching Physical Awareness to LLMs through Sounds

NIO

Model and MethodsModelAudio In

ACORN explores and validates the feasibility of teaching LLMs to understand the physical world through sounds.

Audio Flamingo 32025-07

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

NVIDIA

Model and MethodsAudio LLMAudio InMultilingual★ 1,137

Audio Flamingo 3 (AF3) is the third generation of NVIDIA's fully-open audio LLM, supporting longer audio context (up to ~10 min), think-then-answer reasoning, and stronger multilingual coverage. Training data, weights, and recipes are all released.

DIFFA2025-07

DIFFA: Large Language Diffusion Models Can Listen and Understand

Nankai University (NKU-HLT)

Model and MethodsModelAudio InMultilingual★ 79

DIFFA explores whether large language diffusion models (rather than autoregressive LLMs) can be adapted to listen to and understand audio, building an audio-conditioned diffusion language model and showing it can match autoregressive counterparts on audio understanding tasks.

OpenS2S2025-07

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

CASIA

ChatbotSpoken Dialogue ModelAudio InAudio OutMultilingual★ 118

OpenS2S is a fully open-source end-to-end empathetic speech-to-speech LLM, releasing training data, training recipe, and model weights, with explicit attention to paralinguistic empathy in spoken dialogue.

Step-Audio 22025-07

Step-Audio 2 Technical Report

Step-Audio Team, StepFun

Model and MethodsAudio LLMAudio InAudio OutMultilingual★ 1,450

Step-Audio 2 is the successor to Step-Audio, scaling the unified speech understanding-and-generation LLM with stronger emotion, paralinguistics, and real-time interaction. Supports both bilingual (Chinese / English) and multilingual end-to-end speech dialogue.

Voxtral2025-07

Voxtral

Mistral AI

Speech RecognitionAudio Understanding ModelAudio InMultilingual

Voxtral is Mistral AI's open audio LLM family (3B and 24B) for speech transcription, multilingual understanding, and Q&A over long-form audio — released with permissive weights and competitive performance against closed-source ASR systems.

CMI-Bench2025-06

CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

Queen Mary University of London

BenchmarkAudio In★ 18

This work presents CMI-Bench, a benchmark that evaluates audio-text LLMs on diverse music tasks by reformatting traditional MIR annotations into instruction-following formats. It highlights performance gaps and biases, offering a foundation for improving music-aware LLMs.

PAL2025-06

PAL: Probing Audio Encoders via LLMs - A Study of Information Transfer from Audio Encoders to LLMs

CVSSP,PAI@University of Surrey UK, MBZUAI Abu Dhabi

Model and MethodsModelAudio InMultilingual★ 12

PAL investigates and explores strategies for integrating audio encoders with LLMs, focusing on efficient cross-modal information transfer. Guided by hypotheses derived from mechanistic interpretability studies and the operational principles of LLMs.

CosyVoice 32025-05

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

FunAudioLLM Team, Tongyi Lab, Alibaba Group

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 21,370

CosyVoice 3 scales the CosyVoice TTS stack with significantly larger pre-training data and a dedicated post-training stage, targeting in-the-wild speech generation across more languages, accents, and acoustic conditions.

LALM-Temporal-Bench2025-05

Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning

Indian Institute of Science (IISc), Bangalore

BenchmarkAudio InEnglish

An INTERSPEECH 2025 benchmark for evaluating Large Audio-Language Models (LALMs) on temporal reasoning over audio, with an additional analysis of model confidence calibration on these tasks.

MMAR2025-05

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Shanghai Jiao Tong University

BenchmarkMultilingual

MMAR is a challenging benchmark of 1,000 real-world audio QA triplets designed to evaluate deep, multi-layer reasoning in Audio-Language Models across diverse sound, music, and speech tasks, with hierarchical annotations and Chain-of-Thought rationales to drive progress in audio reasoning research.

Kimi-Audio2025-04

Kimi-Audio Technical Report

Moonshot AI

Model and MethodsAudio LLMAudio InAudio OutMultilingual★ 4,646

Kimi-Audio is Moonshot AI's open-source audio foundation model unifying speech understanding, audio understanding, and speech generation in a single LLM, trained on ~13M hours of audio with strong performance on ASR, audio captioning, audio QA, and speech dialogue.

Audio Flamingo 22025-03

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

NVIDIA, University of Maryland

Model and MethodsModelAudio InEnglish★ 1,137

Audio Flamingo 2 (AF2) is the successor to Audio Flamingo, designed for long-audio understanding (up to 5 minutes) and expert reasoning over non-speech sounds and music. The authors also introduce AudioSkills, LongAudio, and LongAudioBench to support training and evaluation.

Audio-Reasoner2025-03

Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models

Nanyang Technological University, Skywork AI

Model and MethodsModelAudio InMultilingual★ 297

Audio-Reasoner is a large audio language model designed for deep reasoning over audio. The authors construct CoTA, a 1.2M-sample chain-of-thought dataset for audio tasks, and fine-tune the model to perform structured reasoning on audio understanding benchmarks.

FireRedTTS2025-03

FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System

FireRed Team, Xiaohongshu

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 912

FireRedTTS-1S is Xiaohongshu's streamable foundation TTS, improving streaming latency and prosody control over its predecessor with chunk-wise generation suitable for live voice products.

Full-Duplex-Bench2025-03

Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities

National Taiwan University, UC Berkeley, MIT

BenchmarkAudio InAudio OutEnglish★ 188

Full-Duplex-Bench is a benchmark for evaluating full-duplex spoken dialogue models on real-time interaction phenomena such as turn-taking, pauses, interruptions, and backchanneling — capabilities that traditional half-duplex evaluation cannot cover.

Phi-4-Mini2025-03

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Microsoft

Model and MethodsMultimodal Language ModelAudio InMultilingual

Phi-4-Mini and Phi-4-Multimodal are compact language and multimodal models from Microsoft. Phi-4-Mini is a 3.8B-parameter LLM; the multimodal variant extends it to vision and speech/audio via a Mixture-of-LoRAs design, achieving competitive results while remaining lightweight.

Qwen2.5-Omni2025-03

Qwen2.5-Omni Technical Report

Qwen Team, Alibaba Group

Model and MethodsOmni-Modal LLMAudio InAudio OutMultilingual★ 4,014

Qwen2.5-Omni is Alibaba's end-to-end omni-modal LLM handling text, image, audio, and video as inputs and producing both text and streaming speech outputs, built on a Thinker-Talker dual-track architecture that decouples reasoning and speech generation.

Audio-FLAN2025-02

Audio-FLAN: A Preliminary Release

The Hong Kong University of Science and Technology

Dataset ResourceEnglish★ 160

Audio-FLAN is a large-scale instruction-tuning dataset with over 100 million instances across 80 tasks in speech, music, and sound, designed to unify audio understanding and generation for developing generalist audio-language models.

IndexTTS2025-02

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Bilibili

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 20,884

IndexTTS is Bilibili's industrial-grade zero-shot TTS system optimised for production scenarios — controllable prosody, low-latency inference, and strong Chinese and English voice cloning from short reference audio.

OSUM2025-02

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

ASLP@NPU

Model and MethodsModelAudio InMultilingual★ 495

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

OWLS2025-02

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

Carnegie Mellon University, NVIDIA

Model and MethodsModelAudio InMultilingual

OWLS systematically studies neural scaling laws for multilingual speech recognition and translation models, training a suite of models from 0.25B to 18B parameters on up to 360K hours of public speech data across 150+ languages to characterise how performance scales with data, compute, and parameter count.

Step-Audio2025-02

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Step-Audio Team, StepFun

Model and MethodsModelAudio InAudio OutMultilingual

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies.

Audio-CoT2025-01

Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model

Nanyang Technological University, Singapore

Model and MethodsModelAudio InEnglish

Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALMs to enhance their reasoning ability across auditory modalities. We evaluate representative CoT methods, analyzing their performance in both information extraction and reasoning tasks across sound, music, and speech domains. Our findings reveal that CoT methods significantly improve performance on easy and medium tasks but encounter challenges with hard tasks, where reasoning chains can confuse the model rather than improve accuracy. Additionally, we identify a positive correlation between reasoning path length and accuracy, demonstrating the potential of scaling inference for advanced instruction-following and reasoning. This study not only highlights the promise of CoT in enhancing LALM reasoning capabilities but also identifies key limitations and provides actionable directions for future research.

LUCY2025-01

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Tencent

Model and MethodsModelAudio InAudio OutEnglish★ 60

The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.

MinMo2025-01

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

FunAudioLLM Team, Tongyi Lab, Alibaba Group

ChatbotMultimodal Large Language ModelAudio InAudio OutMultilingual

MinMo is a multimodal large language model with approximately 8 billion parameters, designed for seamless voice interaction. It facilitates real-time, natural, and human-like voice conversations by integrating speech and text processing. Trained on 1.4 million hours of diverse speech data, MinMo supports full-duplex communication, enabling simultaneous two-way interactions between the user and the system. It also offers enhanced instruction-following capabilities, allowing control over speech generation with nuances such as emotions, dialects, speaking rates, and voice mimicry. The model achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text-based large language models.

Sayna2025-01

Sayna: Voice Infrastructure for Audio LLM Applications

SaynaAI

Model and MethodsInfrastructureAudio InAudio OutMultilingual★ 171

Sayna is a real-time voice infrastructure platform for building production voice-enabled LLM agents. It provides a unified API layer for STT/TTS with real-time streaming, multi-provider support, VAD, and voice analytics. Built with Rust and LiveKit, it offers low-latency WebSocket connections and REST endpoints for seamless voice-first experiences. Self-hostable with Docker and Kubernetes support.

UltraEval-Audio2025-01

UltraEval-Audio

OpenBMB

BenchmarkMultilingual★ 301

UltraEval-Audio

ADU-Bench2024-12

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Tsinghua University, University of Oxford

BenchmarkAudio In

ADU-Bench is a comprehensive evaluation benchmark designed to assess the open-ended audio dialogue understanding capabilities of Large Audio-Language Models (LALMs). It comprises over 20,000 open-ended audio dialogues across various scenarios, skills, languages, and ambiguity categories, providing a robust framework for evaluating and advancing LALMs in real-world audio dialogue applications.

CosyVoice 22024-12

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

FunAudioLLM Team, Tongyi Lab, Alibaba Group

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 21,370

CosyVoice 2 is Alibaba's streaming TTS LLM, combining a unified speech tokenizer with a streaming-friendly LLM backbone to enable bidirectional streaming with sub-150 ms latency and improved cross-lingual zero-shot voice cloning.

GLM-4-Voice2024-12

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Zhipu AI, Tsinghua University

ChatbotSpoken Dialogue ModelAudio InAudio OutBilingual (Chinese and English)★ 3,181

GLM-4-Voice is an end-to-end spoken chatbot from Zhipu/Tsinghua that takes speech in and produces speech out directly, supporting low-latency streaming and natural Chinese / English conversation with controllable emotion, pitch, and speaking rate.

MERaLiON-AudioLLM2024-12

MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models

I2R, A*STAR, Singapore

Model and MethodsModelAudio InMultilingual
TalkArena2024-12

TalkArena: Interactive Evaluation of Large Audio Models

Stanford University, SCB 10X

BenchmarkInteractive Benchmarking ToolAudio InEnglish★ 4

TalkArena is an interactive platform designed to benchmark Large Audio Models (AudioLLMs) through real-world user interactions. Similar to Chatbot Arena for text-based models, TalkArena allows users to input audio prompts and receive text-based responses from various state-of-the-art models, facilitating pairwise comparisons and user preference evaluations. The platform supports models such as GPT-4o, Gemini, Qwen2-Audio, DiVA-Llama 3, and Typhoon-Audio, enabling comprehensive assessments of their performance in natural, conversational settings.

Typhoon2-Audio2024-12

Typhoon2-Audio: A Thai Multimodal Language Model for Speech and Text Processing

SCB 10X

Model and MethodsMultimodal Language ModelAudio InAudio OutThai, English★ 35

Typhoon2-Audio is a multimodal language model designed for Thai and English speech and text processing. It supports speech/audio input and both speech and text output, integrating components from SALMONN and Llama-Omni architectures. The model is trained on curated datasets to enhance instruction-following abilities and Thai language performance.

Dynamic-SUPERB Phase-22024-11

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

National Taiwan University, University of Texas at Austin, Carnegie Mellon University, Nanyang Technological University, Toyota Technological Institute of Chicago, Université du Québec (INRS-EMT), NVIDIA, ASAPP, Renmin University of China

BenchmarkEvaluation FrameworkAudio InMultilingual★ 200

Dynamic-SUPERB Phase-2 is an open and evolving benchmark designed for the comprehensive evaluation of instruction-based universal speech models. Building upon its first generation, this second phase incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks. It broadens evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio domains. The benchmark aims to guide the development of universal spoken language models by providing a diverse and comprehensive evaluation platform.

Taiwanese AudioLLM2024-11

Building a Taiwanese Mandarin Spoken Language Model: A First Attempt

National Taiwan University

Model and MethodsModelAudio InAudio OutTaiwanese Mandarin

This technical report presents an initial attempt to develop a spoken large language model (LLM) for Taiwanese Mandarin, tailored for real-time, speech-to-speech interactions in multi-turn conversations. The end-to-end model employs a decoder-only transformer architecture, aiming for seamless interaction with full-duplex capabilities that allow simultaneous speaking and listening. The report details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction, and introduces a platform to evaluate conversational fluency and response coherence in multi-turn dialogues.

WavChat-Survey2024-11

WavChat: A Survey of Spoken Dialogue Models

Zhejiang University

Survey
DiVA2024-10

Distilling an End-to-End Voice Assistant Without Instruction Training Data

Georgia Tech, Stanford

Model and MethodsModelAudio InAudio Out

DiVA (Distilled Voice Assistant) is an end-to-end voice assistant model that integrates speech and text processing without relying on instruction training data. By utilizing self-supervision from a text-only large language model's responses to transcripts, DiVA generalizes to tasks such as spoken question answering, classification, and translation. Notably, it achieves a 72% user preference win rate compared to state-of-the-art models like Qwen 2 Audio, despite using significantly less training compute.

F5-TTS2024-10

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Shanghai Jiao Tong University

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 14,646

F5-TTS is a fully non-autoregressive TTS system based on flow matching with Diffusion Transformer, producing high-fidelity zero-shot voice cloning faster than autoregressive codec-LM TTS systems.

MMAU2024-10

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

University of Maryland

BenchmarkEnglish★ 152
SPIRIT LM2024-10

SPIRIT LM: Interleaved Spoken and Written Language Model

Meta

Model and MethodsModelAudio InAudio Out★ 930

SPIRIT LM is a foundational multimodal language model developed by Meta that seamlessly integrates text and speech modalities. By extending a pretrained text language model to the speech domain through continuous training on both text and speech units, SPIRIT LM can process interleaved speech and text sequences. It comes in two versions: BASE, utilizing speech phonetic units (HuBERT), and EXPRESSIVE, which incorporates pitch and style units to model expressivity. The model demonstrates capabilities in tasks such as ASR, TTS, and speech classification, leveraging few-shot learning across modalities.

SpeechEmotionLlama2024-10

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

MIT, Meta

Model and MethodsModelAudio In

This paper explores the capability of large language models (LLMs) to understand paralinguistic aspects of speech, such as emotions and speaking styles, without fine-tuning their weights. By training a speech encoder to produce token embeddings that align the LLM's responses to expressive speech prompts with semantically matching text prompts specifying the speaker's emotion, the system effectively conveys both semantic and paralinguistic information to the LLM. Experiments demonstrate that this approach enables LLMs to generate higher quality and more empathetic responses to expressive speech inputs.

SpeechLLM-Survey2024-10

A Survey on Speech Large Language Models

SJTU, AISpeech

Survey
SpeechLM-Survey2024-10

Recent Advances in Speech Language Models: A Survey

CUHK, Tencent

Survey
VoiceBench2024-10

VoiceBench: Benchmarking LLM-Based Voice Assistants

National University of Singapore

BenchmarkAudio In★ 367

VoiceBench is a comprehensive evaluation framework designed to assess the capabilities of LLM-based voice assistants. It evaluates various aspects, including general knowledge, instruction-following abilities, and safety measures, using both synthetic and real spoken instruction data that reflect real-world variations such as speaker characteristics, environmental factors, and content complexities.

ASRCompare2024-09

Comparing Discrete and Continuous Space LLMs for Speech Recognition

Tsinghua University, Tencent AI Lab

Model and MethodsModelAudio In★ 3

This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR). It organizes these representations by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. The study further classifies LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, it provides a detailed examination of their effectiveness. Notably, the work presents an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing research.

AudioBERT2024-09

AudioBERT: Audio Knowledge Augmented Language Model

POSTECH, Inha University

Model and MethodsModelAudio In★ 40

AudioBERT is a language model augmented with auditory knowledge to enhance its performance on tasks requiring an understanding of sounds. It employs a retrieval-based approach, utilizing an Auditory Knowledge Span Detector to identify text spans necessitating auditory knowledge. Relevant audio embeddings are retrieved using CLAP (Contrastive Language-Audio Pretraining) and integrated into the language model. This method enables AudioBERT to effectively handle tasks such as animal sound recognition and sound pitch comparison, as demonstrated on the AuditoryBench dataset.

DeSTA22024-09

Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

National Taiwan University, NVIDIA

Model and MethodsModelAudio In★ 126

DeSTA2 is a speech-language model that integrates pre-trained speech models with large language models to interpret and generate comprehensive natural language descriptions. It enhances the model's speech comprehension capabilities without extensive speech instruction-tuning, thereby preserving the inherent language understanding of the text-based LLM. DeSTA2 demonstrates impressive performance on benchmarks like Dynamic-SUPERB and AIR-Bench-Chat, showcasing its ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-of-thought reasoning.

EMOVA2024-09

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

HKUST

MultimodalModelAudio InAudio OutEnglish
LLaMA-Omni2024-09

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)

Model and MethodsModelAudio InAudio Out★ 3,142

LLaMA-Omni is a low-latency, high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct. It enables seamless speech interactions with large language models, simultaneously generating both text and speech responses based on speech instructions. The model integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder, eliminating the need for intermediate speech transcription. Experimental results demonstrate that LLaMA-Omni provides superior responses in both content and style, with response latency as low as 226ms. Training LLaMA-Omni requires less than 3 days on 4 GPUs, facilitating efficient development of speech-language models.

MaskGCT2024-09

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

The Chinese University of Hong Kong (Shenzhen), Amphion

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 9,824

MaskGCT is a fully non-autoregressive TTS system that predicts masked codec tokens with a transformer in two stages (text→duration→codec). Released as part of the Amphion toolkit; strong zero-shot voice cloning.

MoWE-Audio2024-09

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

A*STAR

Model and MethodsModelAudio In

MoWE-Audio introduces a novel approach to enhance Audio Large Language Models (AudioLLMs) by incorporating a mixture of 'weak' encoders. This method supplements a base encoder with a pool of lightweight encoders, selectively activated based on the audio input, to improve feature extraction without significantly increasing model size. Empirical results demonstrate that MoWE effectively enhances multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

Moshi2024-09

Moshi: a speech-text foundation model for real-time dialogue

Kyutai

Model and MethodsModelAudio InAudio Out★ 10,309

Moshi is a speech-text foundation model and full-duplex spoken dialogue framework that addresses limitations in current spoken dialogue systems by integrating speech recognition and generation into a single model. It enables real-time, natural conversations by reducing latency and preserving non-linguistic information such as emotion and accent. Moshi models multiple audio streams in parallel, allowing for seamless handling of overlapping speech and interruptions, thereby enhancing the naturalness of human-computer interactions.

SALMon2024-09

A Suite for Acoustic Language Model Evaluation

Hebrew University of Jerusalem

BenchmarkAudio In

SALMon is a novel evaluation suite encompassing background noise, emotion, speaker identity, and room impulse response. It evaluates both the consistency of the inspected element and its alignment with the spoken text, providing a comprehensive benchmark for speech language models.

Ultravox2024-09

Ultravox: A Fast Multimodal LLM for Real-Time Voice

Fixie.ai

Model and MethodsModelAudio InMultilingual★ 4,435

Ultravox is an open-source multimodal large language model (LLM) designed for real-time voice interactions. It extends any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLMs, eliminating the need for a separate Automatic Speech Recognition (ASR) stage. This direct coupling allows Ultravox to respond more quickly than systems that combine separate ASR and LLM components. The current version (v0.4) supports multiple languages, including Arabic, Chinese, Dutch, English, French, German, Hindi, Italian, Japanese, Portuguese, Russian, Spanish, Swedish, Turkish, and Ukrainian. Ultravox is capable of understanding both text and human speech, making it suitable for applications such as voice agents, speech-to-speech translation, and analysis of spoken audio.

Mini-Omni2024-08

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Tsinghua University

Model and MethodsModelAudio InAudio Out★ 3,551

Mini-Omni is an open-source multimodal large language model designed for real-time speech interaction. It features end-to-end speech input and streaming audio output capabilities, enabling seamless voice conversations without the need for separate ASR or TTS systems. The model employs a text-instructed speech generation method and batch-parallel strategies during inference to enhance performance. Additionally, the VoiceAssistant-400K dataset is introduced to fine-tune models optimized for speech output. Mini-Omni aims to facilitate real-time human-computer interaction by integrating speech processing directly into the language model framework.

MooER2024-08

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

Moore Threads

Model and MethodsModelAudio InMultilingual★ 219

MooER is a Large Language Model (LLM)-based system developed by Moore Threads for automatic speech recognition (ASR) and automatic speech translation (AST). Trained on a 5,000-hour pseudo-labeled dataset comprising open-source and self-collected speech data, MooER achieves performance comparable to other open-source models trained on significantly larger datasets. Notably, it attains a BLEU score of 25.2 on the Covost2 Zh2en test set, indicating superior translation capabilities. The model architecture integrates an encoder, adapter, and decoder (LLM), optimized with techniques such as DeepSpeed, data loader acceleration, gradient checkpointing, gradient accumulation, and BF16 training. MooER supports multiple languages and is designed to facilitate end-to-end speech interaction, translation, and recognition tasks.

MuChoMusic2024-08

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

UPF, QMUL, UMG

BenchmarkAudio In★ 45
Typhoon-Audio2024-08

Typhoon-Audio: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

SCB 10X

Model and MethodsMultimodal Language ModelAudio InThai, English

Typhoon-Audio is a multimodal language model supporting speech/audio input and text output. Based on the SALMONN architecture, it is trained on curated datasets to enhance general instruction-following abilities and performance in the Thai language, addressing challenges in low-resource language processing.

VITA2024-08

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Tencent Youtu Lab, Nanjing University, Xiamen University

MultimodalOmni-Modal LLMAudio InAudio OutMultilingual★ 2,512

VITA is one of the first fully open-source omni-modal LLMs supporting interactive video, image, audio, and text inputs with non-awakening interaction and audio interrupt — establishing a recipe later adopted by many Chinese omni models.

AudioEntailment2024-07

Audio Entailment: Assessing Deductive Reasoning for Audio Understanding

CMU, Microsoft

BenchmarkAudio In★ 17
CompA2024-07

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

University of Maryland, College Park; Adobe, USA; NVIDIA, Bangalore, India

Model and MethodsModelAudio In★ 23

CompA introduces two expert-annotated benchmarks, CompA-order and CompA-attribute, designed to evaluate compositional reasoning in audio-language models (ALMs). CompA-order assesses an ALM's understanding of the sequence of acoustic events, while CompA-attribute evaluates attribute-binding of these events. The study reveals that current ALMs perform marginally better than random chance in compositional reasoning tasks. To address this, the authors propose CompA-CLAP, a fine-tuned model employing a novel learning method with composition-aware hard negatives and a modular contrastive loss, enhancing fine-grained compositional understanding without relying on extensive compositional audio datasets. CompA-CLAP demonstrates significant improvements over baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

Decoder-only LLMs for STT2024-07

Investigating Decoder-only Large Language Models for Speech-to-text Translation

NTU-Taiwan, Meta

Model and MethodsResearchAudio InMultilingual

This research paper explores the application of decoder-only large language models for speech-to-text translation, analyzing their effectiveness and potential advantages in multilingual translation tasks.

FunAudioLLM2024-07

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Alibaba

Model and MethodsModelAudio InAudio OutMultilingual

FunAudioLLM is a foundation model developed by Alibaba for voice understanding and generation, facilitating natural interaction between humans and large language models. It supports multilingual audio input and output, enabling seamless voice-based communication and interaction.

GAMA2024-07

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

University of Maryland, College Park

Model and MethodsModelAudio In★ 153

GAMA is a General-purpose Large Audio-Language Model (LALM) designed to enhance audio understanding and complex reasoning abilities. It integrates a Large Language Model (LLM) with multiple types of audio representations, including features from a custom Audio Q-Former and a multi-layer aggregator that processes features from various layers of an audio encoder. Fine-tuned on a large-scale audio-language dataset, GAMA is equipped with advanced audio understanding capabilities. Additionally, it employs CompA-R, a synthetically generated instruction-tuning dataset, to endow the model with complex reasoning abilities, particularly for open-ended audio question-answering tasks. GAMA outperforms existing LALMs across diverse audio understanding tasks, demonstrating superior performance in both automated and expert human evaluations.

LLaST2024-07

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

The Chinese University of Hong Kong, Shenzhen; Shanghai AI Laboratory; Nara Institute of Science and Technology, Japan

Model and MethodsModelAudio InMultilingual★ 26

LLaST is a framework designed to enhance end-to-end speech-to-text translation systems by leveraging Large Language Models (LLMs). It addresses limitations in traditional E2E ST models through innovative architecture design and optimization techniques, including ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Evaluations on the CoVoST-2 benchmark demonstrate LLaST's superior performance and scalability, making it a strong baseline for future speech translation research.

Qwen2-Audio2024-07

Qwen2-Audio Technical Report

Alibaba Group

Model and MethodsModelAudio InMultilingual★ 2,073

Qwen2-Audio is a large-scale audio-language model developed by Alibaba Group, capable of accepting various audio signal inputs and performing audio analysis or generating textual responses based on speech instructions. It introduces two distinct audio interaction modes: voice chat, allowing users to engage in voice interactions without text input, and audio analysis, enabling users to provide audio and text instructions for analysis during interaction. The model has been enhanced with instruction-following capabilities and optimized using Direct Preference Optimization (DPO) to improve performance in terms of factuality and adherence to desired behavior. Evaluations indicate that Qwen2-Audio outperforms previous state-of-the-art models in audio-centric instruction-following tasks.

SenseVoice2024-07

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Tongyi SpeechTeam, Alibaba Group

Speech RecognitionSpeech Understanding ModelAudio InMultilingual

SenseVoice is a multilingual speech understanding foundation model (part of the FunAudioLLM report) that jointly performs speech recognition, spoken emotion recognition, and audio event detection. The non-autoregressive SenseVoice-Small (234M params) supports 50+ languages and runs roughly 5x faster than Whisper-Large at comparable accuracy.

Stable Audio Open2024-07

Stable Audio Open

Stability AI

Audio GenerationAudio Generation ModelAudio OutEnglish

Stable Audio Open is Stability AI's open-weight text-to-audio diffusion model for generating short stereo audio clips (up to ~47s) including sound effects and music samples, trained only on Creative Commons audio.

Audio Hallucination2024-06

Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

NTU-Taiwan

StudyResearchAudio In★ 34
AudioBench2024-06

AudioBench: A Universal Benchmark for Audio Large Language Models

A*STAR, Singapore

BenchmarkAudio In★ 310

AudioBench is a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, including 7 newly proposed datasets, targeting speech understanding, audio scene understanding, and voice understanding (paralinguistic).

CodecFake2024-06

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

National Taiwan University

SafetyEnglish★ 21
DeSTA2024-06

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

NTU-Taiwan, Nvidia

Model and MethodsModelAudio InMultilingual★ 1

DeSTA is a model that enhances speech language models by aligning descriptive speech and text, improving the model's ability to understand and generate accurate transcriptions across multiple languages.

E2 TTS2024-06

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Microsoft

Speech SynthesisTTS ModelAudio InAudio OutEnglish

E2 TTS removes nearly all of the usual TTS pipeline complexity — no phoneme aligner, no duration predictor, no explicit grapheme-to-phoneme model — and trains a flow-matching transformer end-to-end on text + audio. Foundational influence on F5-TTS and follow-on systems.

MusiLingo2024-06

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

University of Pennsylvania

Model and MethodsModelAudio InEnglish★ 50

MusiLingo is a novel system that bridges music audio and language by aligning MERT and a frozen LLM via a single projection layer, enabling high-quality music captioning and question answering, supported by the newly introduced MusicInstruct dataset.

SD-Eval2024-06

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

CUHK, Bytedance

BenchmarkAudio In★ 56
Speech ReaLLM2024-06

Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time

Meta

Model and MethodsModelAudio InMultilingual

Speech ReaLLM is a real-time streaming speech recognition model developed by Meta, utilizing multimodal large language models to understand the temporal flow of speech for accurate and efficient transcription.

AIR-Bench2024-05

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

ZJU, Alibaba

BenchmarkAudio InAudio Out★ 132
Audio Flamingo2024-05

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

Nvidia

Model and MethodsModelAudio InAudio OutMultilingual★ 1,137

Audio Flamingo is an audio language model developed by Nvidia, featuring few-shot learning and dialogue capabilities. It supports multilingual audio input and output, enabling natural and context-aware audio interactions.

SpeechVerse2024-05

SpeechVerse: A Large-scale Generalizable Audio Language Model

Amazon AGI

Model and MethodsModelAudio InMultilingual

SpeechVerse is a large-scale multitask audio language model from Amazon AGI. It pairs a frozen speech foundation model with an LLM through a small trainable adapter and is supervised on a broad mixture of speech tasks via natural-language instructions, achieving strong zero- and few-shot generalisation across 11 speech understanding tasks.

VoiceJailbreak2024-05

Voice Jailbreak Attacks Against GPT-4o

CISPA

SafetyMethodAudio InMultilingual★ 38
LibriSQA2024-04

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

Shanghai Jiao Tong University

Dataset ResourceEnglish★ 39

While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs.

SALMONN2024-04

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Tsinghua

Model and MethodsModelAudio InMultilingual★ 1,443

SALMONN is a model developed by Tsinghua University aiming to equip large language models with generic hearing abilities, enhancing their capacity to process and understand diverse audio inputs.

SpokenWOZ2024-03

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

Tencent

BenchmarkEnglish★ 1,556

Task-oriented dialogue (TOD) models have made significant progress in recent years. However, previous studies primarily focus on datasets written by annotators, which has resulted in a gap between academic research and real-world spoken conversation scenarios. While several small-scale spoken TOD datasets are proposed to address robustness issues such as ASR errors, they ignore the unique challenges in spoken conversation. To tackle the limitations, we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from human-to-human spoken conversations. SpokenWOZ further incorporates common spoken characteristics such as word-by-word processing and reasoning in spoken language. Based on these characteristics, we present cross-turn slot and reasoning slot detection as new challenges. We conduct experiments on various baselines, including text-modal models, newly proposed dual-modal models, and LLMs, e.g., ChatGPT. The results show that the current models still have substantial room for improvement in spoken conversation, where the most advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and the SOTA end-to-end model only correctly completes the user request in 52.1% of dialogues. The dataset, code, and leaderboard are available: this https URL.

WavLLM2024-03

WavLLM: Towards Robust and Adaptive Speech Large Language Model

CUHK

Model and MethodsModelAudio InMultilingual★ 1,442

WavLLM is a speech large language model developed by CUHK, designed to be robust and adaptive across various speech processing tasks, supporting multilingual audio inputs for comprehensive language understanding.

AudioLM-Survey2024-02

Towards audio language modeling -- an overview

National Taiwan University, MIT

Survey
SLAM-LLM2024-02

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Shanghai Jiao Tong University (SJTU)

Model and MethodsModelAudio InMultilingual★ 1,034

SLAM-LLM is a model developed by SJTU that integrates large language models with strong automatic speech recognition (ASR) capabilities, providing a simple yet effective approach for speech-to-text tasks across multiple languages.

Pengi2024-01

Pengi: An Audio Language Model for Audio Tasks

Microsoft

Model and MethodsModelAudio InAudio OutMultilingual★ 322

Pengi is an audio language model developed by Microsoft, designed to handle various audio tasks by processing and generating audio inputs and outputs across multiple languages.

Qwen-Audio2023-12

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Alibaba

Model and MethodsModelAudio InAudio OutMultilingual★ 1,895

Qwen-Audio is a large-scale audio-language model developed by Alibaba, aiming to advance universal audio understanding by integrating audio and language processing capabilities in a unified framework.

CoDi-22023-11

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

UC Berkeley

MultimodalModelAudio InAudio OutEnglish★ 1,706
OpenJMLA2023-10

Joint Music and Language Attention Models for Zero-shot Music Tagging

ByteDance

Model and MethodsModelAudio InEnglish

JMLA (Joint Music and Language Attention) introduces an open-set music tagging model that combines a pretrained music encoder with a language model via attention, enabling zero-shot tagging on arbitrary tag vocabularies rather than fixed closed-set labels.

UniAudio2023-10

An Audio Foundation Model Toward Universal Audio Generation

Chinese University of Hong Kong (CUHK)

Model and MethodsModelAudio InAudio OutMultilingual★ 604

UniAudio is an audio foundation model developed by CUHK, aiming toward universal audio generation by supporting various audio generation tasks, including speech, sound, music, and singing voice, based on diverse input conditions.

Dynamic-SUPERB2023-09

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

NTU-Taiwan, etc.

BenchmarkAudio InAudio Out★ 200
LLaSM2023-09

LLaSM: Large Language and Speech Model

LinkSoul.AI

Model and MethodsModelAudio InBilingual (Chinese and English)★ 559

LLaSM is a large language and speech model developed by LinkSoul.AI, supporting bilingual (Chinese and English) speech-text multimodal dialogues. It offers convenient speech input, enhancing user experience by avoiding the complexities and potential errors associated with ASR-based solutions.

LTU-AS2023-09

Joint Audio and Speech Understanding

MIT, IBM Research

Model and MethodsModelAudio InEnglish★ 474

LTU-AS (Listen, Think, and Understand — Audio and Speech) extends the LTU model to jointly handle non-speech audio and speech understanding by combining a Whisper-style speech encoder with an audio event encoder, feeding both into an LLM for unified reasoning over audio inputs.

Segment-level Q-Former2023-09

Connecting Speech Encoder and Large Language Model for ASR

Tsinghua University, ByteDance

Model and MethodsModelAudio In

This paper presents a comparative study of three connector structures—fully connected layers, multi-head cross-attention, and Q-Former—for integrating speech encoders with large language models (LLMs) in automatic speech recognition (ASR) systems. The study finds that LLMs with Q-Formers achieve consistent and significant word error rate reductions over other connector structures. Additionally, a novel segment-level Q-Former is proposed to enable LLMs to recognize longer speech segments, resulting in further performance improvements.

AudioLDM 22023-08

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

University of Surrey, Imperial College London

Audio GenerationAudio Generation ModelAudio OutEnglish★ 2,627

AudioLDM 2 unifies speech, sound effects, and music generation in a single latent diffusion framework by introducing a shared 'language of audio' learnt from self-supervised pretraining, enabling holistic high-quality audio generation from text.

SeamlessM4T2023-08

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Meta AI

Speech RecognitionSpeech Translation ModelAudio InAudio OutMultilingual★ 11,785

SeamlessM4T is Meta's unified multilingual multimodal translation model covering ASR, speech-to-text translation, speech-to-speech translation, text-to-text and text-to-speech across nearly 100 input and 35+ output languages in a single system.

Prompting LLMs with Speech Recognition2023-07

Prompting Large Language Models with Speech Recognition Abilities

Meta

Model and MethodsModelAudio In

This paper presents a method to extend large language models (LLMs) with speech recognition capabilities by integrating a small audio encoder. By prepending audio embeddings to text token embeddings, the LLM can function as an automatic speech recognition (ASR) system. Experiments demonstrate that incorporating a conformer encoder into the LLaMA-7B model enables it to outperform monolingual baselines and perform multilingual speech recognition, despite being predominantly trained on English text.

DAC2023-06

High-Fidelity Audio Compression with Improved RVQGAN

Descript

Audio GenerationNeural Audio CodecAudio InAudio OutMultilingual★ 1,807

Descript Audio Codec (DAC) is a high-fidelity universal 44.1 kHz neural audio codec achieving ~90x compression with substantially better quality than EnCodec, widely used as the discrete tokenizer for downstream audio generation models.

Macaw-LLM2023-06

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

Tencent

MultimodalModelAudio InMultilingual★ 1,591
MusicGen2023-06

Simple and Controllable Music Generation

Meta AI

Audio GenerationMusic Generation ModelAudio InAudio OutEnglish★ 23,329

MusicGen is Meta's single-stage autoregressive transformer for controllable text-conditioned music generation, operating over discrete EnCodec tokens with optional melody conditioning. Part of the AudioCraft suite.

StyleTTS 22023-06

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Columbia University

Speech SynthesisTTS ModelAudio InAudio OutEnglish★ 6,272

StyleTTS 2 models speech styles as a latent random variable through diffusion and adversarial training with large speech language models, achieving human-level naturalness on LJSpeech and strong zero-shot speaker cloning.

FunASR2023-05

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Speech Lab, Alibaba DAMO Academy

Speech RecognitionSpeech Recognition ToolkitAudio InMultilingual

FunASR is an industrial-grade open-source speech recognition toolkit from Alibaba's Speech Lab that bridges academic research and production deployment. It ships pretrained models including the non-autoregressive Paraformer (SOTA CER on many Mandarin benchmarks), FSMN-VAD, punctuation restoration, CAM++ speaker diarization, timestamp prediction, and streaming recognition across 50+ languages.

MMS2023-05

Scaling Speech Technology to 1,000+ Languages

Meta AI

Speech RecognitionSpeech Recognition ModelAudio InAudio OutMultilingual★ 32,230

MMS (Massively Multilingual Speech) extends speech foundation models (wav2vec 2.0) to 1,107 languages for ASR and adds TTS and language identification for 1,400+ languages, dramatically expanding speech coverage beyond the previously dominant ~100 languages.

SpeechGPT2023-05

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Fudan University

Model and MethodsModelAudio InAudio Out★ 1,403

SpeechGPT is a multimodal large language model developed by Fudan University, capable of perceiving and generating multimodal content following human instructions. It integrates cross-modal conversational abilities, enabling it to handle tasks involving speech and text seamlessly.

AudioGPT2023-04

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Zhejiang University

Model and MethodsModelAudio InAudio Out★ 10,179

AudioGPT is a multimodal AI system that integrates Large Language Models (LLMs) with foundation models to process complex audio information, enabling tasks such as understanding and generating speech, music, sound, and talking head. It supports spoken dialogue through ASR and TTS interfaces, facilitating human-like interactions and content creation.

VALL-E2023-01

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Microsoft

Speech SynthesisTTS ModelAudio InAudio OutEnglish★ 22,138

VALL-E reframes text-to-speech as a conditional language modeling task over discrete audio codec tokens (EnCodec), enabling zero-shot voice cloning from a 3-second enrollment recording with strong speaker similarity and prosody.

Whisper2022-12

Robust Speech Recognition via Large-Scale Weak Supervision

OpenAI

Speech RecognitionSpeech Recognition ModelAudio InMultilingual★ 101,110

Whisper is OpenAI's open-source speech recognition model trained on 680K hours of multilingual and multitask supervised data from the web. It performs robust transcription, translation to English, and language identification across 99 languages.

EnCodec2022-10

High Fidelity Neural Audio Compression

Meta AI

Audio GenerationNeural Audio CodecAudio InAudio OutMultilingual★ 3,966

EnCodec is Meta's streaming neural audio codec that compresses 24/48 kHz audio with high perceptual fidelity using a residual vector quantizer. Its discrete tokens are the foundation for MusicGen, AudioGen, and VALL-E.

AudioGen2022-09

AudioGen: Textually Guided Audio Generation

Meta AI, Hebrew University of Jerusalem

Audio GenerationAudio Generation ModelAudio OutEnglish★ 23,329

AudioGen is a transformer-based autoregressive model for text-to-environmental-sound generation, trained on discrete audio tokens. It established the recipe later used by MusicGen and is part of Meta's AudioCraft.