audio-ai-hub

Featured — most-starred foundational works

Whisper2022-12

Robust Speech Recognition via Large-Scale Weak Supervision

OpenAI

Speech RecognitionSpeech Recognition ModelAudio InMultilingual★ 105,272

Whisper is OpenAI's open-source speech recognition model trained on 680K hours of multilingual and multitask supervised data from the web. It performs robust transcription, translation to English, and language identification across 99 languages.

VoxCPM22026-06

VoxCPM2 Technical Report

OpenBMB, ModelBest, Tsinghua University (THUHCSI)

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 33,829

VoxCPM2 is a fully open-source 2B-parameter multilingual, controllable speech generation foundation model extending VoxCPM's hierarchical diffusion-autoregressive paradigm. It unifies 30 languages, 9 Chinese dialects, natural-language voice design, style-controllable voice cloning, and high-fidelity continuation cloning in a single backbone, using an asymmetric AudioVAE that encodes at 16 kHz and reconstructs at 48 kHz, trained on over 2 million hours of speech.

MMS2023-05

Scaling Speech Technology to 1,000+ Languages

Meta AI

Speech RecognitionSpeech Recognition ModelAudio InAudio OutMultilingual★ 32,244

MMS (Massively Multilingual Speech) extends speech foundation models (wav2vec 2.0) to 1,107 languages for ASR and adds TTS and language identification for 1,400+ languages, dramatically expanding speech coverage beyond the previously dominant ~100 languages.

MiniCPM-o2026-04

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

ModelBest (OpenBMB), Tsinghua University

MultimodalOmni-Modal LLMAudio InAudio OutMultilingual★ 25,937

MiniCPM-o 4.5 is OpenBMB's compact (8B-class) full-duplex omni-modal LLM supporting real-time vision, speech, and text interaction with low-latency streaming TTS, designed for on-device and edge deployment.

MusicGen2023-06

Simple and Controllable Music Generation

Meta AI

Audio GenerationMusic Generation ModelAudio InAudio OutEnglish★ 23,503

MusicGen is Meta's single-stage autoregressive transformer for controllable text-conditioned music generation, operating over discrete EnCodec tokens with optional melody conditioning. Part of the AudioCraft suite.

AudioGen2022-09

AudioGen: Textually Guided Audio Generation

Meta AI, Hebrew University of Jerusalem

Audio GenerationAudio Generation ModelAudio OutEnglish★ 23,503

AudioGen is a transformer-based autoregressive model for text-to-environmental-sound generation, trained on discrete audio tokens. It established the recipe later used by MusicGen and is part of Meta's AudioCraft.

CosyVoice 32025-05

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

FunAudioLLM Team, Tongyi Lab, Alibaba Group

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 22,282

CosyVoice 3 scales the CosyVoice TTS stack with significantly larger pre-training data and a dedicated post-training stage, targeting in-the-wild speech generation across more languages, accents, and acoustic conditions.

CosyVoice 22024-12

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

FunAudioLLM Team, Tongyi Lab, Alibaba Group

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 22,282

CosyVoice 2 is Alibaba's streaming TTS LLM, combining a unified speech tokenizer with a streaming-friendly LLM backbone to enable bidirectional streaming with sub-150 ms latency and improved cross-lingual zero-shot voice cloning.

All entries

Sort:

ACA-SER2026-06

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

University of Augsburg, Technical University of Munich, Imperial College London

StudyAudio InEnglish

A probing study testing whether instruction-following audio language models use explicit acoustic concept tokens (six interpretable cues derived from the eGeMAPS feature set: energy, pitch, dynamics, brightness, formants, and voice quality) in a grounded way for speech emotion recognition. On FAU-Aibo and IEMOCAP, aligned tokens improve unweighted average recall while shuffled, conflicting, or corrupted tokens degrade it.

AVSR-Gen2026-06

Assessing True Generalisability of Audio-Visual Speech Recognisers

Trinity College Dublin, Imperial College London

StudyAudio InEnglish★ 2

Introduces MV2LRS3, a controlled unseen test set subsampled from MultiVSR to strictly match the acoustic, visual, and demographic distribution of LRS3, and shows that five state-of-the-art audio-visual speech recognition models suffer a universal performance collapse under these matched conditions. A fine-grained attribute analysis isolates the drivers of degradation, indicating current systems fail to truly generalise.

Audio-Oscar2026-06

Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement

Shanghai Jiao Tong University (X-LANCE Lab), Shanghai Innovation Institute, Shanghai AI Laboratory, Xiamen University

Audio GenerationFrameworkAudio Out★ 58

Audio-Oscar is a multi-agent framework that coordinates specialist agents (character and voice design, speech generation, fine-grained timeline planning, model selection, non-speech generation, and audio post-production) to produce long-form, controllable audio from complex scene descriptions. The authors also introduce ASG-Bench for evaluating complex audio scene generation.

CogAudio-LLM2026-06

Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models

Audio, Speech and Language Processing Group (ASLP), Northwestern Polytechnical University

Model and MethodsModelAudio InMultilingual★ 4

CogAudio-LLM is a cognitive affective reasoning framework for audio language models that counters textual semantic dominance over acoustic nuance. It introduces LIME-440K, a lexically-identical multi-emotion dataset for acoustic-semantic decoupling, and EIPS, a four-step chain-of-thought psychological reasoning mechanism, established via multi-stage training to produce empathetic, emotion-aware responses.

DSFA2026-06

Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

National Taiwan University

SafetyMethodAudio InEnglish

Proposes Domain-Shift Feature Augmentation (DSFA), which turns deterministic feature statistics into stochastic distributions during fine-tuning to simulate in-the-wild variation and improve the generalization of codec-based deepfake speech countermeasures. Also introduces CoSG ExtEval, a more challenging evaluation set covering 40 unseen generative models and long-form audio.

KIT-IWSLT20262026-06

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

Karlsruhe Institute of Technology (KIT)

Speech SynthesisTTS SystemAudio InAudio OutMultilingual

KIT's cross-lingual voice cloning system for the IWSLT 2026 track, built on the multilingual TTS model FishAudio-S2-Pro. It adds language-tag prompting to improve language control and reduce accent leakage, applies reinforcement-learning fine-tuning for task adaptation, and proposes a reference-conditioned lexical matching method to improve pronunciation of domain-specific terms.

VoxCPM22026-06

VoxCPM2 Technical Report

OpenBMB, ModelBest, Tsinghua University (THUHCSI)

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 33,829

dots.tts2026-06

dots.tts Technical Report

rednote-hilab (RedNote / Xiaohongshu)

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 931

dots.tts is a 2B-parameter continuous autoregressive TTS foundation model that models speech in a continuous latent space, combining an AudioVAE trained with multiple objectives, full-history conditioning in a flow-matching head for long-range consistency, and reward-free self-corrective post-training. Trained on a large multilingual corpus, it supports zero-shot voice cloning and achieves the best average performance on Seed-TTS-Eval.

BEA-Dialogue+2026-05

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

Dataset ResourceDatasetAudio InHungarian

BEA-Dialogue+ is an expanded conversational Hungarian ASR corpus that relaxes the strictly speaker-disjoint split of BEA-Dialogue while preserving separation of the primary speakers, yielding 200 hours of transcribed natural conversation (up from 85). It enables a controlled study of the trade-off between additional training data and speaker overlap, evaluated with Whisper- and FastConformer-based models.

Chatterbox-Flash2026-05

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

Speech SynthesisModelAudio InAudio Out

Chatterbox-Flash is a zero-shot TTS model created by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. It introduces two inference-time techniques—prior-calibrated scoring and an early-decoding schedule—to counter the long-tail token bias that otherwise degrades parallel decoding quality.

MindVoice2026-05

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

MultimodalModelAudio Out

MindVoice is a neuro-to-speech reconstruction framework that recovers intelligible speech from noisy, spatially-blurred non-invasive neural recordings by leveraging pretrained models to compensate for incomplete semantic and acoustic information. It targets safe, scalable speech brain-computer interfaces, moving past prior methods that produced spectrally-similar but unintelligible output.

SURE2026-05

A Unified and Reproducible Experimentation Framework for Speech Understanding

BenchmarkAudio In

SURE is a unified experimentation framework for speech understanding that standardizes prediction formats, normalization, and scoring to make evaluations comparable across paradigms, from conventional pipelines to Speech LLMs. It adds an agent-assisted training-conversion flow that maps papers and code into versioned, runnable training pipelines on matched open-data subsets.

SwanSphere2026-05

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Audio GenerationModelAudio Out

SwanSphere is a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. It uses a causal autoregressive diffusion transformer for low-latency streaming synthesis and a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with acoustic spatial cues.

UNISON2026-05

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

Audio GenerationModelAudio InAudio Out

UNISON is a latent diffusion framework that unifies speech generation, sound generation, and audio editing in a single set of weights, covering text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, and scene-level/timed editing. It uses layer-wise deep LLM fusion, injecting hidden states from a frozen MLLM into corresponding MM-DiT blocks for depth-matched semantic conditioning.

UniAudio-Token2026-05

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Model and MethodsSpeech TokenizerAudio In

UniAudio-Token augments single-codebook semantic speech tokenizers with general audio perception without sacrificing speech ability, addressing the 'acoustic blindness' of linguistically-focused tokenizers. It introduces Semantic-Acoustic Primitives (SAP) that decompose audio into linguistic content, vocal attributes, and auditory-scene primitives, plus a content-aware Semantic-Acoustic Equilibrium (SAE) gating mechanism.

MiniCPM-o2026-04

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

ModelBest (OpenBMB), Tsinghua University

MultimodalOmni-Modal LLMAudio InAudio OutMultilingual★ 25,937

Fun-ASR-Nano2025-12

Fun-ASR Technical Report

FunAudioLLM Team, Tongyi Lab, Alibaba Group

Model and MethodsSpeech Recognition ModelAudio InMultilingual★ 1,410

End-to-end LLM-based ASR (SenseVoice Encoder + Transformer Adaptor + Qwen3-0.6B LLM + CTC Decoder) from the FunAudioLLM team. Trained on tens of millions of hours of real speech, supports 31 languages, 7 Chinese dialects, 26 regional accents, lyrics recognition, hotwords, timestamps, and speaker diarization. Streaming inference accelerated via vLLM (up to 393x realtime).

Qwen3-Omni2025-09

Qwen3-Omni Technical Report

Qwen Team, Alibaba Group

Model and MethodsOmni-Modal LLMAudio InAudio OutMultilingual★ 3,898

Qwen3-Omni is the third-generation omni-modal LLM from Alibaba, scaling up the Thinker-Talker design with stronger multilingual ASR, audio understanding, and real-time speech generation across 100+ input and 30+ output languages.

ACORN2025-07

Teaching Physical Awareness to LLMs through Sounds

NIO

Model and MethodsModelAudio In

ACORN explores and validates the feasibility of teaching LLMs to understand the physical world through sounds.

Audio Flamingo 32025-07

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

NVIDIA

Model and MethodsAudio LLMAudio InMultilingual★ 1,154

Audio Flamingo 3 (AF3) is the third generation of NVIDIA's fully-open audio LLM, supporting longer audio context (up to ~10 min), think-then-answer reasoning, and stronger multilingual coverage. Training data, weights, and recipes are all released.

DIFFA2025-07

DIFFA: Large Language Diffusion Models Can Listen and Understand

Nankai University (NKU-HLT)

Model and MethodsModelAudio InMultilingual★ 83

DIFFA explores whether large language diffusion models (rather than autoregressive LLMs) can be adapted to listen to and understand audio, building an audio-conditioned diffusion language model and showing it can match autoregressive counterparts on audio understanding tasks.

OpenS2S2025-07

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

CASIA

ChatbotSpoken Dialogue ModelAudio InAudio OutMultilingual★ 119

OpenS2S is a fully open-source end-to-end empathetic speech-to-speech LLM, releasing training data, training recipe, and model weights, with explicit attention to paralinguistic empathy in spoken dialogue.

Step-Audio 22025-07

Step-Audio 2 Technical Report

Step-Audio Team, StepFun

Model and MethodsAudio LLMAudio InAudio OutMultilingual★ 1,483

Step-Audio 2 is the successor to Step-Audio, scaling the unified speech understanding-and-generation LLM with stronger emotion, paralinguistics, and real-time interaction. Supports both bilingual (Chinese / English) and multilingual end-to-end speech dialogue.

Voxtral2025-07

Voxtral

Mistral AI

Speech RecognitionAudio Understanding ModelAudio InMultilingual

Voxtral is Mistral AI's open audio LLM family (3B and 24B) for speech transcription, multilingual understanding, and Q&A over long-form audio — released with permissive weights and competitive performance against closed-source ASR systems.

CMI-Bench2025-06

CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

Queen Mary University of London

BenchmarkAudio In★ 18

This work presents CMI-Bench, a benchmark that evaluates audio-text LLMs on diverse music tasks by reformatting traditional MIR annotations into instruction-following formats. It highlights performance gaps and biases, offering a foundation for improving music-aware LLMs.

PAL2025-06

PAL: Probing Audio Encoders via LLMs - A Study of Information Transfer from Audio Encoders to LLMs

CVSSP,PAI@University of Surrey UK, MBZUAI Abu Dhabi

Model and MethodsModelAudio InMultilingual★ 12

PAL investigates and explores strategies for integrating audio encoders with LLMs, focusing on efficient cross-modal information transfer. Guided by hypotheses derived from mechanistic interpretability studies and the operational principles of LLMs.

CosyVoice 32025-05

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

FunAudioLLM Team, Tongyi Lab, Alibaba Group

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 22,282

LALM-Temporal-Bench2025-05

Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning

Indian Institute of Science (IISc), Bangalore

BenchmarkAudio InEnglish

An INTERSPEECH 2025 benchmark for evaluating Large Audio-Language Models (LALMs) on temporal reasoning over audio, with an additional analysis of model confidence calibration on these tasks.

MMAR2025-05

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Shanghai Jiao Tong University

BenchmarkMultilingual

MMAR is a challenging benchmark of 1,000 real-world audio QA triplets designed to evaluate deep, multi-layer reasoning in Audio-Language Models across diverse sound, music, and speech tasks, with hierarchical annotations and Chain-of-Thought rationales to drive progress in audio reasoning research.

Kimi-Audio2025-04

Kimi-Audio Technical Report

Moonshot AI

Model and MethodsAudio LLMAudio InAudio OutMultilingual★ 4,677

Kimi-Audio is Moonshot AI's open-source audio foundation model unifying speech understanding, audio understanding, and speech generation in a single LLM, trained on ~13M hours of audio with strong performance on ASR, audio captioning, audio QA, and speech dialogue.

Audio Flamingo 22025-03

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

NVIDIA, University of Maryland

Model and MethodsModelAudio InEnglish★ 1,154

Audio Flamingo 2 (AF2) is the successor to Audio Flamingo, designed for long-audio understanding (up to 5 minutes) and expert reasoning over non-speech sounds and music. The authors also introduce AudioSkills, LongAudio, and LongAudioBench to support training and evaluation.

Audio-Reasoner2025-03

Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models

Nanyang Technological University, Skywork AI

Model and MethodsModelAudio InMultilingual★ 297

Audio-Reasoner is a large audio language model designed for deep reasoning over audio. The authors construct CoTA, a 1.2M-sample chain-of-thought dataset for audio tasks, and fine-tune the model to perform structured reasoning on audio understanding benchmarks.

FireRedTTS2025-03

FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System

FireRed Team, Xiaohongshu

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 908

FireRedTTS-1S is Xiaohongshu's streamable foundation TTS, improving streaming latency and prosody control over its predecessor with chunk-wise generation suitable for live voice products.

Full-Duplex-Bench2025-03

Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities

National Taiwan University, UC Berkeley, MIT

BenchmarkAudio InAudio OutEnglish★ 235

Full-Duplex-Bench is a benchmark for evaluating full-duplex spoken dialogue models on real-time interaction phenomena such as turn-taking, pauses, interruptions, and backchanneling — capabilities that traditional half-duplex evaluation cannot cover.

Phi-4-Mini2025-03

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Microsoft

Model and MethodsMultimodal Language ModelAudio InMultilingual

Phi-4-Mini and Phi-4-Multimodal are compact language and multimodal models from Microsoft. Phi-4-Mini is a 3.8B-parameter LLM; the multimodal variant extends it to vision and speech/audio via a Mixture-of-LoRAs design, achieving competitive results while remaining lightweight.

Qwen2.5-Omni2025-03

Qwen2.5-Omni Technical Report

Qwen Team, Alibaba Group

Model and MethodsOmni-Modal LLMAudio InAudio OutMultilingual★ 4,039

Qwen2.5-Omni is Alibaba's end-to-end omni-modal LLM handling text, image, audio, and video as inputs and producing both text and streaming speech outputs, built on a Thinker-Talker dual-track architecture that decouples reasoning and speech generation.

Audio-FLAN2025-02

Audio-FLAN: A Preliminary Release

The Hong Kong University of Science and Technology

Dataset ResourceEnglish★ 161

Audio-FLAN is a large-scale instruction-tuning dataset with over 100 million instances across 80 tasks in speech, music, and sound, designed to unify audio understanding and generation for developing generalist audio-language models.

IndexTTS2025-02

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Bilibili

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 21,999

IndexTTS is Bilibili's industrial-grade zero-shot TTS system optimised for production scenarios — controllable prosody, low-latency inference, and strong Chinese and English voice cloning from short reference audio.

OSUM2025-02

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

ASLP@NPU

Model and MethodsModelAudio InMultilingual★ 494

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

OWLS2025-02

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

Carnegie Mellon University, NVIDIA

Model and MethodsModelAudio InMultilingual

OWLS systematically studies neural scaling laws for multilingual speech recognition and translation models, training a suite of models from 0.25B to 18B parameters on up to 360K hours of public speech data across 150+ languages to characterise how performance scales with data, compute, and parameter count.

Step-Audio2025-02

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Step-Audio Team, StepFun

Model and MethodsModelAudio InAudio OutMultilingual★ 34

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies.

Audio-CoT2025-01

Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model

Nanyang Technological University, Singapore

Model and MethodsModelAudio InEnglish

Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALMs to enhance their reasoning ability across auditory modalities. We evaluate representative CoT methods, analyzing their performance in both information extraction and reasoning tasks across sound, music, and speech domains. Our findings reveal that CoT methods significantly improve performance on easy and medium tasks but encounter challenges with hard tasks, where reasoning chains can confuse the model rather than improve accuracy. Additionally, we identify a positive correlation between reasoning path length and accuracy, demonstrating the potential of scaling inference for advanced instruction-following and reasoning. This study not only highlights the promise of CoT in enhancing LALM reasoning capabilities but also identifies key limitations and provides actionable directions for future research.

LUCY2025-01

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Tencent

Model and MethodsModelAudio InAudio OutEnglish★ 61

The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.

MinMo2025-01

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

FunAudioLLM Team, Tongyi Lab, Alibaba Group

ChatbotMultimodal Large Language ModelAudio InAudio OutMultilingual

MinMo is a multimodal large language model with approximately 8 billion parameters, designed for seamless voice interaction. It facilitates real-time, natural, and human-like voice conversations by integrating speech and text processing. Trained on 1.4 million hours of diverse speech data, MinMo supports full-duplex communication, enabling simultaneous two-way interactions between the user and the system. It also offers enhanced instruction-following capabilities, allowing control over speech generation with nuances such as emotions, dialects, speaking rates, and voice mimicry. The model achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text-based large language models.

Sayna2025-01

Sayna: Voice Infrastructure for Audio LLM Applications

SaynaAI

Model and MethodsInfrastructureAudio InAudio OutMultilingual★ 232

Sayna is a real-time voice infrastructure platform for building production voice-enabled LLM agents. It provides a unified API layer for STT/TTS with real-time streaming, multi-provider support, VAD, and voice analytics. Built with Rust and LiveKit, it offers low-latency WebSocket connections and REST endpoints for seamless voice-first experiences. Self-hostable with Docker and Kubernetes support.

UltraEval-Audio2025-01

UltraEval-Audio

OpenBMB

BenchmarkMultilingual★ 308

UltraEval-Audio

ADU-Bench2024-12

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Tsinghua University, University of Oxford

BenchmarkAudio In

ADU-Bench is a comprehensive evaluation benchmark designed to assess the open-ended audio dialogue understanding capabilities of Large Audio-Language Models (LALMs). It comprises over 20,000 open-ended audio dialogues across various scenarios, skills, languages, and ambiguity categories, providing a robust framework for evaluating and advancing LALMs in real-world audio dialogue applications.

CosyVoice 22024-12

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

FunAudioLLM Team, Tongyi Lab, Alibaba Group

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 22,282

GLM-4-Voice2024-12

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Zhipu AI, Tsinghua University

ChatbotSpoken Dialogue ModelAudio InAudio OutBilingual (Chinese and English)★ 3,204

GLM-4-Voice is an end-to-end spoken chatbot from Zhipu/Tsinghua that takes speech in and produces speech out directly, supporting low-latency streaming and natural Chinese / English conversation with controllable emotion, pitch, and speaking rate.

MERaLiON-AudioLLM2024-12

MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models

I2R, A*STAR, Singapore

Model and MethodsModelAudio InMultilingual

TalkArena2024-12

TalkArena: Interactive Evaluation of Large Audio Models

Stanford University, SCB 10X

BenchmarkInteractive Benchmarking ToolAudio InEnglish★ 4

TalkArena is an interactive platform designed to benchmark Large Audio Models (AudioLLMs) through real-world user interactions. Similar to Chatbot Arena for text-based models, TalkArena allows users to input audio prompts and receive text-based responses from various state-of-the-art models, facilitating pairwise comparisons and user preference evaluations. The platform supports models such as GPT-4o, Gemini, Qwen2-Audio, DiVA-Llama 3, and Typhoon-Audio, enabling comprehensive assessments of their performance in natural, conversational settings.

Typhoon2-Audio2024-12

Typhoon2-Audio: A Thai Multimodal Language Model for Speech and Text Processing

SCB 10X

Model and MethodsMultimodal Language ModelAudio InAudio OutThai, English★ 36

Typhoon2-Audio is a multimodal language model designed for Thai and English speech and text processing. It supports speech/audio input and both speech and text output, integrating components from SALMONN and Llama-Omni architectures. The model is trained on curated datasets to enhance instruction-following abilities and Thai language performance.

Dynamic-SUPERB Phase-22024-11

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

National Taiwan University, University of Texas at Austin, Carnegie Mellon University, Nanyang Technological University, Toyota Technological Institute of Chicago, Université du Québec (INRS-EMT), NVIDIA, ASAPP, Renmin University of China

BenchmarkEvaluation FrameworkAudio InMultilingual★ 200

Dynamic-SUPERB Phase-2 is an open and evolving benchmark designed for the comprehensive evaluation of instruction-based universal speech models. Building upon its first generation, this second phase incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks. It broadens evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio domains. The benchmark aims to guide the development of universal spoken language models by providing a diverse and comprehensive evaluation platform.

Taiwanese AudioLLM2024-11

Building a Taiwanese Mandarin Spoken Language Model: A First Attempt

National Taiwan University

Model and MethodsModelAudio InAudio OutTaiwanese Mandarin

This technical report presents an initial attempt to develop a spoken large language model (LLM) for Taiwanese Mandarin, tailored for real-time, speech-to-speech interactions in multi-turn conversations. The end-to-end model employs a decoder-only transformer architecture, aiming for seamless interaction with full-duplex capabilities that allow simultaneous speaking and listening. The report details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction, and introduces a platform to evaluate conversational fluency and response coherence in multi-turn dialogues.

WavChat-Survey2024-11

WavChat: A Survey of Spoken Dialogue Models

Zhejiang University

Survey

DiVA2024-10

Distilling an End-to-End Voice Assistant Without Instruction Training Data

Georgia Tech, Stanford

Model and MethodsModelAudio InAudio Out

DiVA (Distilled Voice Assistant) is an end-to-end voice assistant model that integrates speech and text processing without relying on instruction training data. By utilizing self-supervision from a text-only large language model's responses to transcripts, DiVA generalizes to tasks such as spoken question answering, classification, and translation. Notably, it achieves a 72% user preference win rate compared to state-of-the-art models like Qwen 2 Audio, despite using significantly less training compute.

F5-TTS2024-10

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Shanghai Jiao Tong University

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 14,981

F5-TTS is a fully non-autoregressive TTS system based on flow matching with Diffusion Transformer, producing high-fidelity zero-shot voice cloning faster than autoregressive codec-LM TTS systems.

MMAU2024-10

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

University of Maryland

BenchmarkEnglish★ 156

SPIRIT LM2024-10

SPIRIT LM: Interleaved Spoken and Written Language Model

Meta

Model and MethodsModelAudio InAudio Out★ 928

SPIRIT LM is a foundational multimodal language model developed by Meta that seamlessly integrates text and speech modalities. By extending a pretrained text language model to the speech domain through continuous training on both text and speech units, SPIRIT LM can process interleaved speech and text sequences. It comes in two versions: BASE, utilizing speech phonetic units (HuBERT), and EXPRESSIVE, which incorporates pitch and style units to model expressivity. The model demonstrates capabilities in tasks such as ASR, TTS, and speech classification, leveraging few-shot learning across modalities.

SpeechEmotionLlama2024-10

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

MIT, Meta

Model and MethodsModelAudio In

This paper explores the capability of large language models (LLMs) to understand paralinguistic aspects of speech, such as emotions and speaking styles, without fine-tuning their weights. By training a speech encoder to produce token embeddings that align the LLM's responses to expressive speech prompts with semantically matching text prompts specifying the speaker's emotion, the system effectively conveys both semantic and paralinguistic information to the LLM. Experiments demonstrate that this approach enables LLMs to generate higher quality and more empathetic responses to expressive speech inputs.

SpeechLLM-Survey2024-10

A Survey on Speech Large Language Models

SJTU, AISpeech

Survey

SpeechLM-Survey2024-10

Recent Advances in Speech Language Models: A Survey

CUHK, Tencent

Survey

VoiceBench2024-10

VoiceBench: Benchmarking LLM-Based Voice Assistants

National University of Singapore

BenchmarkAudio In★ 378

VoiceBench is a comprehensive evaluation framework designed to assess the capabilities of LLM-based voice assistants. It evaluates various aspects, including general knowledge, instruction-following abilities, and safety measures, using both synthetic and real spoken instruction data that reflect real-world variations such as speaker characteristics, environmental factors, and content complexities.

ASRCompare2024-09

Comparing Discrete and Continuous Space LLMs for Speech Recognition

Tsinghua University, Tencent AI Lab

Model and MethodsModelAudio In★ 3

This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR). It organizes these representations by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. The study further classifies LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, it provides a detailed examination of their effectiveness. Notably, the work presents an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing research.

AudioBERT2024-09

AudioBERT: Audio Knowledge Augmented Language Model

POSTECH, Inha University

Model and MethodsModelAudio In★ 40

AudioBERT is a language model augmented with auditory knowledge to enhance its performance on tasks requiring an understanding of sounds. It employs a retrieval-based approach, utilizing an Auditory Knowledge Span Detector to identify text spans necessitating auditory knowledge. Relevant audio embeddings are retrieved using CLAP (Contrastive Language-Audio Pretraining) and integrated into the language model. This method enables AudioBERT to effectively handle tasks such as animal sound recognition and sound pitch comparison, as demonstrated on the AuditoryBench dataset.

DeSTA22024-09

Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

National Taiwan University, NVIDIA

Model and MethodsModelAudio In★ 127

DeSTA2 is a speech-language model that integrates pre-trained speech models with large language models to interpret and generate comprehensive natural language descriptions. It enhances the model's speech comprehension capabilities without extensive speech instruction-tuning, thereby preserving the inherent language understanding of the text-based LLM. DeSTA2 demonstrates impressive performance on benchmarks like Dynamic-SUPERB and AIR-Bench-Chat, showcasing its ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-of-thought reasoning.

EMOVA2024-09

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

HKUST

MultimodalModelAudio InAudio OutEnglish

LLaMA-Omni2024-09

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)

Model and MethodsModelAudio InAudio Out★ 3,140

LLaMA-Omni is a low-latency, high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct. It enables seamless speech interactions with large language models, simultaneously generating both text and speech responses based on speech instructions. The model integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder, eliminating the need for intermediate speech transcription. Experimental results demonstrate that LLaMA-Omni provides superior responses in both content and style, with response latency as low as 226ms. Training LLaMA-Omni requires less than 3 days on 4 GPUs, facilitating efficient development of speech-language models.

MaskGCT2024-09

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

The Chinese University of Hong Kong (Shenzhen), Amphion

Speech SynthesisTTS ModelAudio InAudio OutMultilingual★ 9,958

MaskGCT is a fully non-autoregressive TTS system that predicts masked codec tokens with a transformer in two stages (text→duration→codec). Released as part of the Amphion toolkit; strong zero-shot voice cloning.

MoWE-Audio2024-09

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

A*STAR

Model and MethodsModelAudio In

MoWE-Audio introduces a novel approach to enhance Audio Large Language Models (AudioLLMs) by incorporating a mixture of 'weak' encoders. This method supplements a base encoder with a pool of lightweight encoders, selectively activated based on the audio input, to improve feature extraction without significantly increasing model size. Empirical results demonstrate that MoWE effectively enhances multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

Moshi2024-09

Moshi: a speech-text foundation model for real-time dialogue

Kyutai

Model and MethodsModelAudio InAudio Out★ 10,625

Moshi is a speech-text foundation model and full-duplex spoken dialogue framework that addresses limitations in current spoken dialogue systems by integrating speech recognition and generation into a single model. It enables real-time, natural conversations by reducing latency and preserving non-linguistic information such as emotion and accent. Moshi models multiple audio streams in parallel, allowing for seamless handling of overlapping speech and interruptions, thereby enhancing the naturalness of human-computer interactions.

SALMon2024-09

A Suite for Acoustic Language Model Evaluation

Hebrew University of Jerusalem

BenchmarkAudio In

SALMon is a novel evaluation suite encompassing background noise, emotion, speaker identity, and room impulse response. It evaluates both the consistency of the inspected element and its alignment with the spoken text, providing a comprehensive benchmark for speech language models.

Ultravox2024-09

Ultravox: A Fast Multimodal LLM for Real-Time Voice

Fixie.ai

Model and MethodsModelAudio InMultilingual★ 4,476

Ultravox is an open-source multimodal large language model (LLM) designed for real-time voice interactions. It extends any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLMs, eliminating the need for a separate Automatic Speech Recognition (ASR) stage. This direct coupling allows Ultravox to respond more quickly than systems that combine separate ASR and LLM components. The current version (v0.4) supports multiple languages, including Arabic, Chinese, Dutch, English, French, German, Hindi, Italian, Japanese, Portuguese, Russian, Spanish, Swedish, Turkish, and Ukrainian. Ultravox is capable of understanding both text and human speech, making it suitable for applications such as voice agents, speech-to-speech translation, and analysis of spoken audio.

Mini-Omni2024-08

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Tsinghua University

Model and MethodsModelAudio InAudio Out★ 3,562

Mini-Omni is an open-source multimodal large language model designed for real-time speech interaction. It features end-to-end speech input and streaming audio output capabilities, enabling seamless voice conversations without the need for separate ASR or TTS systems. The model employs a text-instructed speech generation method and batch-parallel strategies during inference to enhance performance. Additionally, the VoiceAssistant-400K dataset is introduced to fine-tune models optimized for speech output. Mini-Omni aims to facilitate real-time human-computer interaction by integrating speech processing directly into the language model framework.

MooER2024-08

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

Moore Threads

Model and MethodsModelAudio InMultilingual★ 219

MooER is a Large Language Model (LLM)-based system developed by Moore Threads for automatic speech recognition (ASR) and automatic speech translation (AST). Trained on a 5,000-hour pseudo-labeled dataset comprising open-source and self-collected speech data, MooER achieves performance comparable to other open-source models trained on significantly larger datasets. Notably, it attains a BLEU score of 25.2 on the Covost2 Zh2en test set, indicating superior translation capabilities. The model architecture integrates an encoder, adapter, and decoder (LLM), optimized with techniques such as DeepSpeed, data loader acceleration, gradient checkpointing, gradient accumulation, and BF16 training. MooER supports multiple languages and is designed to facilitate end-to-end speech interaction, translation, and recognition tasks.

MuChoMusic2024-08

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

UPF, QMUL, UMG

BenchmarkAudio In★ 46

Typhoon-Audio2024-08

Typhoon-Audio: Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

SCB 10X

Model and MethodsMultimodal Language ModelAudio InThai, English

Typhoon-Audio is a multimodal language model supporting speech/audio input and text output. Based on the SALMONN architecture, it is trained on curated datasets to enhance general instruction-following abilities and performance in the Thai language, addressing challenges in low-resource language processing.

VITA2024-08

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Tencent Youtu Lab, Nanjing University, Xiamen University

MultimodalOmni-Modal LLMAudio InAudio OutMultilingual★ 2,519

VITA is one of the first fully open-source omni-modal LLMs supporting interactive video, image, audio, and text inputs with non-awakening interaction and audio interrupt — establishing a recipe later adopted by many Chinese omni models.

AudioEntailment2024-07

Audio Entailment: Assessing Deductive Reasoning for Audio Understanding

CMU, Microsoft

BenchmarkAudio In★ 17

CompA2024-07

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

University of Maryland, College Park; Adobe, USA; NVIDIA, Bangalore, India

Model and MethodsModelAudio In★ 23

CompA introduces two expert-annotated benchmarks, CompA-order and CompA-attribute, designed to evaluate compositional reasoning in audio-language models (ALMs). CompA-order assesses an ALM's understanding of the sequence of acoustic events, while CompA-attribute evaluates attribute-binding of these events. The study reveals that current ALMs perform marginally better than random chance in compositional reasoning tasks. To address this, the authors propose CompA-CLAP, a fine-tuned model employing a novel learning method with composition-aware hard negatives and a modular contrastive loss, enhancing fine-grained compositional understanding without relying on extensive compositional audio datasets. CompA-CLAP demonstrates significant improvements over baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

Decoder-only LLMs for STT2024-07

Investigating Decoder-only Large Language Models for Speech-to-text Translation

NTU-Taiwan, Meta

Model and MethodsResearchAudio InMultilingual

This research paper explores the application of decoder-only large language models for speech-to-text translation, analyzing their effectiveness and potential advantages in multilingual translation tasks.

FunAudioLLM2024-07

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Alibaba

Model and MethodsModelAudio InAudio OutMultilingual

FunAudioLLM is a foundation model developed by Alibaba for voice understanding and generation, facilitating natural interaction between humans and large language models. It supports multilingual audio input and output, enabling seamless voice-based communication and interaction.

GAMA2024-07

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

University of Maryland, College Park

Model and MethodsModelAudio In★ 153

GAMA is a General-purpose Large Audio-Language Model (LALM) designed to enhance audio understanding and complex reasoning abilities. It integrates a Large Language Model (LLM) with multiple types of audio representations, including features from a custom Audio Q-Former and a multi-layer aggregator that processes features from various layers of an audio encoder. Fine-tuned on a large-scale audio-language dataset, GAMA is equipped with advanced audio understanding capabilities. Additionally, it employs CompA-R, a synthetically generated instruction-tuning dataset, to endow the model with complex reasoning abilities, particularly for open-ended audio question-answering tasks. GAMA outperforms existing LALMs across diverse audio understanding tasks, demonstrating superior performance in both automated and expert human evaluations.

LLaST2024-07

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

The Chinese University of Hong Kong, Shenzhen; Shanghai AI Laboratory; Nara Institute of Science and Technology, Japan

Model and MethodsModelAudio InMultilingual★ 26

LLaST is a framework designed to enhance end-to-end speech-to-text translation systems by leveraging Large Language Models (LLMs). It addresses limitations in traditional E2E ST models through innovative architecture design and optimization techniques, including ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Evaluations on the CoVoST-2 benchmark demonstrate LLaST's superior performance and scalability, making it a strong baseline for future speech translation research.

Qwen2-Audio2024-07

Qwen2-Audio Technical Report

Alibaba Group

Model and MethodsModelAudio InMultilingual★ 2,088

Qwen2-Audio is a large-scale audio-language model developed by Alibaba Group, capable of accepting various audio signal inputs and performing audio analysis or generating textual responses based on speech instructions. It introduces two distinct audio interaction modes: voice chat, allowing users to engage in voice interactions without text input, and audio analysis, enabling users to provide audio and text instructions for analysis during interaction. The model has been enhanced with instruction-following capabilities and optimized using Direct Preference Optimization (DPO) to improve performance in terms of factuality and adherence to desired behavior. Evaluations indicate that Qwen2-Audio outperforms previous state-of-the-art models in audio-centric instruction-following tasks.

SenseVoice2024-07

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Tongyi SpeechTeam, Alibaba Group

Speech RecognitionSpeech Understanding ModelAudio InMultilingual★ 8,895

SenseVoice is a multilingual speech understanding foundation model (part of the FunAudioLLM report) that jointly performs speech recognition, spoken emotion recognition, and audio event detection. The non-autoregressive SenseVoice-Small (234M params) supports 50+ languages and runs roughly 5x faster than Whisper-Large at comparable accuracy.

Stable Audio Open2024-07

Stable Audio Open

Stability AI

Audio GenerationAudio Generation ModelAudio OutEnglish

Stable Audio Open is Stability AI's open-weight text-to-audio diffusion model for generating short stereo audio clips (up to ~47s) including sound effects and music samples, trained only on Creative Commons audio.

Audio Hallucination2024-06

Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models

NTU-Taiwan

StudyResearchAudio In★ 34

AudioBench2024-06

AudioBench: A Universal Benchmark for Audio Large Language Models

A*STAR, Singapore

BenchmarkAudio In★ 319

AudioBench is a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, including 7 newly proposed datasets, targeting speech understanding, audio scene understanding, and voice understanding (paralinguistic).

CodecFake2024-06

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

National Taiwan University

SafetyEnglish★ 22

DeSTA2024-06

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

NTU-Taiwan, Nvidia

Model and MethodsModelAudio InMultilingual★ 1

DeSTA is a model that enhances speech language models by aligning descriptive speech and text, improving the model's ability to understand and generate accurate transcriptions across multiple languages.

E2 TTS2024-06

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Microsoft

Speech SynthesisTTS ModelAudio InAudio OutEnglish

E2 TTS removes nearly all of the usual TTS pipeline complexity — no phoneme aligner, no duration predictor, no explicit grapheme-to-phoneme model — and trains a flow-matching transformer end-to-end on text + audio. Foundational influence on F5-TTS and follow-on systems.

MusiLingo2024-06

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

University of Pennsylvania

Model and MethodsModelAudio InEnglish★ 50

MusiLingo is a novel system that bridges music audio and language by aligning MERT and a frozen LLM via a single projection layer, enabling high-quality music captioning and question answering, supported by the newly introduced MusicInstruct dataset.

SD-Eval2024-06

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

CUHK, Bytedance

BenchmarkAudio In★ 57

Speech ReaLLM2024-06

Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time