Conversational AI Reading Group

TGL logo

Every Thursday at 11am-12pm EDT

Join us via Zoom

Follow us on x-Twitter @convAI2024
Follow us on Bluesky @convai-rg.bsky.social
Visit our Youtube Channel for Past Recordings
Join the Conversational AI slack to discuss with the community: here.
Contact here if there is any issues with the invite link.
Sign up here to receive email communications about the reading group
TGL logo

Upcoming Talks

[Feb 27th, 2025]

  • Open Whisper-Style Speech Models: Transparency, Scalability, and Advancing Explainability
    Presenter: Shinji Watanabe Carnegie Mellon University

    Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia Institute of Technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. Before Carnegie Mellon University, he was an associate research professor at Johns Hopkins University, Baltimore, MD, USA, from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 500 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from ISCA Interspeech in 2024. He is a Senior Area Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (chair, SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP). He is an IEEE and ISCA Fellow.


    Speech foundation models are transforming research by unifying speech-processing tasks through scaling data, model size, and task diversity. This shift has divided research roles, with large tech companies building foundational models and smaller entities focusing on refinement and analysis, raising concerns about explainability due to limited transparency. To address this, our group has developed Open Whisper-style Speech Models (OWSM) at Carnegie Mellon University, replicating OpenAI Whisper-style training using public data and our open-source toolkit ESPnet. Our models exhibit explainable behaviors due to their transparent development. We also investigate scaling laws and emergent capabilities in speech foundation models by studying model and data size impacts within the OWSM suite. This presentation will discuss these advancements and the research challenges they present to the speech and audio community, emphasizing open collaboration and transparency to enhance accessibility and interpretability in speech processing technologies.

  • [March 13th, 2025]

  • TBA
    Presenter:Alexandre Defossez Kyutai

    TBA.


    TBA

  • Past Talks, Winter 2025

    [Feb 20th, 2025]

  • Singing Voice Synthesis: Data curation, Modeling, and Evaluation
    Presenter: Jiatong Shi Carnegie Mellon University

    Jiatong Shi is a Ph.D. candidate in the Language Technologies Institute at Carnegie Mellon University, advised by Dr. Shinji Watanabe. His research focuses on speech representation learning and its applications across various speech processing tasks. He has authored over 70 publications in leading speech and machine learning conferences and has received multiple prestigious honors, including the Best Paper Award at ISCA Interspeech 2024, the Best Paper Award at EMNLP 2024, and the CMU Presidential Fellowship. Jiatong is also a strong advocate for open-source research, making significant contributions to major toolkits such as ESPnet, Muskits, and VERSA. He has played a key role in curating and releasing influential open datasets, including ML-SUPERB, SingMOS, KiSing, and several endangered language corpora, which have driven advancements in speech and music processing.


    Singing voice synthesis (SVS) has emerged as a rapidly evolving research area. However, achieving high-quality and expressive singing synthesis remains a challenging task, requiring large-scale curated datasets, effective modeling strategies, and robust evaluation frameworks. This talk will provide a comprehensive overview of the key components of SVS research, covering data curation, model development, and evaluation methodologies. We will introduce ACE-Opencpop and ACE-KiSing, two large-scale singing voice datasets designed to support diverse SVS applications. On the modeling side, we will explore TokSing, a discrete token-based SVS approach, and SingOMD, which leverages multi-resolution discrete representations to enhance synthesis quality. In terms of evaluation, we will discuss the Interspeech 2024 Challenge on Speech Processing Using Discrete Units, the SingMOS dataset for MOS prediction, and VERSA, a versatile evaluation toolkit for speech, audio, and music processing. By bridging data, modeling, and evaluation, this talk aims to provide insights into the current advancements and challenges in SVS research, highlighting emerging directions for improving naturalness, expressiveness, and overall synthesis quality.

  • [Feb 13th, 2025]

  • Scalable and Efficient Speech Enhancement
    Presenter: Minje Kim University of Illinois at Urbana-Champaign

    Minje Kim is an Associate Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign and a Visiting Academic at Amazon Lab126. Prior to that, he was an Associate Professor at Indiana University. He earned his Ph.D. in CS from UIUC after working as a researcher at ETRI, a national lab in Korea (2006–2011). His research focuses on developing machine learning models for speech and audio problems. He has been recognized with various awards, including the NSF Career Award (2021), the Indiana University Trustees Teaching Award (2021), and the IEEE SPS Best Paper Award (2020), among others. He is the Chair of the IEEE SPS Audio and Acoustic Signal Processing Technical Committee, a Senior Area Editor for IEEE SPL and IEEE/ACM TASLP, an Associate Editor for EURASIP JASMP, and a Consulting Associate Editor for IEEE OJSP. He is also on the program committees of many machine learning and audio/speech conferences, including NeurIPS, ICLR, AAAI, ICASSP, Interspeech, ISMIR, etc. He holds over 50 patents as an inventor.


    Recent advances in single-channel speech enhancement have yielded substantial performance gains but often at the cost of prohibitive model sizes or inference complexity. In this talk, I present a three-part framework addressing these challenges by unifying speaker-agnostic compression, personalized modeling, and scalable architectures. First, we discuss speaker-agnostic model compression via low-bit quantization. Drawing on Bitwise Neural Networks (BNN) and Incremental Binarization on RNNs, we show how feedforward and recurrent networks can be quantized to binary parameters, thereby drastically reducing computational complexity with only a minor degradation in enhancement quality. We also explore discriminative hashing approaches (e.g., Boosted Locality Sensitive Hashing) to represent audio spectra as highly compact binary codes—enabling efficient lookups for source separation in resource-constrained devices. Next, we shift toward personalized speech enhancement—a speaker-aware model compression paradigm. Here, zero-shot approaches that exploit speaker embeddings or knowledge distillation serve to adapt compact models on-the-fly without requiring any clean speech data from the user. By selecting or distilling knowledge from large teacher networks, these personalized systems can adapt to new speakers or recurring noise conditions at test time, thus bridging the performance gap between large, general-purpose models and lightweight, specialized models. Finally, we introduce scalable and efficient enhancement architectures. Building on blockwise optimization (BLOOM-Net) and modified cold diffusion, we design flexible models whose internal blocks or iterative steps can be selectively engaged based on real-time resource constraints. This scalability lets the same network accommodate diverse deployment scenarios—ranging from low-power embedded devices to higher-capacity servers—while maintaining strong enhancement performance.

  • [Feb 6th, 2025]

  • Teaching Foundation Models New Skills: Insights and Experiences
    Presenter: Hung-yi Lee National Taiwan University

    Hung-yi Lee is a professor of the Department of Electrical Engineering at National Taiwan University (NTU), with a joint appointment at the Department of Computer Science & Information Engineering. His recent research focuses on developing technology that can reduce the requirement of annotated data for speech processing (including voice conversion and speech recognition) and natural language processing (including abstractive summarization and question answering). He won the Salesforce Research Deep Learning Grant in 2019, the AWS ML Research Award in 2020, the Outstanding Young Engineer Award from The Chinese Institute of Electrical Engineering in 2018, the Young Scholar Innovation Award from Foundation for the Advancement of Outstanding Scholarship in 2019, Ta-You Wu Memorial Award from Ministry of Science and Technology of Taiwan in 2019, and The 59th Ten Outstanding Young Person Award in Science and Technology Research & Development of Taiwan.


    In today's landscape of natural language processing (NLP) and speech processing, developing applications often begins with fine-tuning a foundation model. However, teaching a foundation model new skills is not as straightforward as it seems. Despite the sophistication of current models, introducing new capabilities can often impair their original functions, a phenomenon known as catastrophic forgetting. While experience replay is a common solution, the lack of open-source training data for models like LLaMA poses challenges for continuous training. This talk will delve into recent research on fine-tuning language models, including their spoken counterparts, focusing on preserving their initial capabilities. This talk will also share some benchmarks related to the ongoing fine-tuning of foundation models.

  • [Jan 23th, 2025]

  • Foundational Speech Models and Their Efficient Training with NVIDIA NeMo
    Presenter: Piotr Żelasko Nvidia

    Piotr Żelasko is a principal research scientist at NVIDIA NeMo. He received his PhD at AGH-UST in Cracow, Poland, and held a research scientist position at JHU’s Center for Language and Speech Processing. Piotr is a co-author of the next-generation Kaldi framework known as k2. His current interests are multi-task, multilingual, and multimodal models involving speech, and training and inference optimization.


    This talk gives an overview of recent developments by NVIDIA NeMo team. We introduce Canary-1B, an open state-of-the-art speech recognition and translation model, and discuss the details of its training: synthetic data generation and efficient dataloading approach that scales to arbitrarily sized datasets. We demonstrate how Canary-1B training was further optimized to decrease the required number of GPUs by 4x with 2D bucketing and batch size optimizer techniques. Finally, we provide a brief overview of SALM and BESTOW architectures for SpeechLLMs and highlight our progress on efficient multimodal SpeechLLM training (EMMETT).

  • [Jan 16th, 2025]

  • Improving Universal Access to Modern Speech Technology
    Presenter: Martijn Bartelds Stanford University

    Martijn Bartelds is a Postdoctoral Scholar at Stanford University, advised by Dan Jurafsky. His research focuses on multilingual speech and language processing, with a particular interest in understanding where language variety and dialect information is encoded in neural speech models, benchmarking, and model training. He received his PhD with the highest distinction from the University of Groningen, where his thesis was nominated for the university's best thesis award. He also received a prestigious NWO Rubicon fellowship and was a visiting researcher at Delft University of Technology and the University of Pennsylvania.


    State-of-the-art speech recognition systems do not work well for many languages, limiting the digital participation of many speakers worldwide. To address this challenge, we need both better ways to reliably measure speech model performance, and new algorithms for bridging this performance gap. In this talk, I propose solutions to both these problems, beginning with ML-SUPERB 2.0, a new benchmark to evaluate multilingual speech models on language identification and automatic speech recognition (ASR) across languages and datasets. Indeed, our benchmark reveals large differences in ASR performance between languages, regardless of the modeling approach used. To mitigate this, I introduce a new model training objective based on distributionally robust optimization. Our new method reduces ASR performance differences between languages by minimizing the training loss of the worst-performing language. This work paves the way for more equal access to speech technology for speakers of all languages.

  • [Jan 9th, 2025]

  • Neural Audio Codecs in the Era of Speech LMs
    Presenter: Haibin Wu Microsoft

    Haibin Wu is a senior researcher at Microsoft, focusing on speech processing. He completed his Ph.D. at National Taiwan University under Prof. Hung-yi Lee. He is a recipient of the Google PhD Fellowship, awarded to only 75 scholars worldwide every year. Haibin has published more than 20 first-author papers in top conferences and journals like ICASSP, Interspeech, TASLP, ACL, ASRU, and SLT. He is also a key contributor to S3prl, an open-source speech toolkit with 2.2k GitHub stars. He gained industry experience through internships at Microsoft, Meta, Amazon, and Tencent, working on speech generation, enhancement, and model compression. Haibin also conducted research as a visiting student at Tsinghua University and the Chinese University of Hong Kong. In addition, Haibin co-organizes the SUPERB and Codec-SUPERB challenges, helping set benchmarks for speech SSL and codec model evaluation.


    Neural audio codecs (NACs) have gained significant attention as essential technologies for audio compression and as foundational components for speech language models. In the era of speech LMs, there are both challenges and opportunities in the codec domain. This talk presents three topics of NACs, including modelling, evaluation and security. This talk introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec provides key benefits, including streaming capability, low computational demands, low bitrate, and a single codebook design, all while delivering high audio quality. Additionally, this talk presents Codec-SUPERB, the first benchmark designed to evaluate codec models in terms of reconstruction quality from both signal-level and application-level perspectives. Finally, this talk presents CodecFake, the first deepfake audio dataset based on codecs. The CodecFake dataset equips models to effectively counter codec-based speech generation systems.

  • Past Talks, Fall 2024

    [Dec 19th, 2024]

  • Discrete Audio Tokens for Multimodal LLMs
    Presenter: Mirco Ravanelli Concordia University - Mila

    Mirco Ravanelli received the Ph.D. (with cum laude distinction) from the University of Trento, Trento, Italy, in December 2017. He is currently an Assistant Professor with Concordia University, Montreal, QC, Canada, an Adjunct Professor with the Universite de Montreal, and a Mila Associate Member. He is the Founder and Leader of the SpeechBrain Project which aims to build an open-source toolkit for conversational AI and speech processing. He is the author or co-author of more than 80 papers on his research interests which include deep learning and conversational AI. He is also an Active Member of the Speech and Machine Learning Communities.


    Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.

  • [Dec 5th, 2024]

  • Posthoc Explanations for Audio Models
    Presenter: Cem Subakan Université Laval - Mila

    Cem Subakan is an assistant prof. at the computer science department of Laval University, an affiliate assistant prof. at Concordia University and also an associate academic member at Mila. His research is on machine learning for speech and audio, recently focusing more on explainable machine learning. He recently co-organized the explainable AI for speech and audio workshop at ICASSP 2024, and will be a general chair for the IEEE MLSP 2025 conference.


    He will discuss his recent work on generating explanations for audio models. While deep learning models excel at achieving high performance, they often function as black boxes, offering little transparency into their decision-making processes. His aim in this line of work is to develop methods that produce listenable explanations for these black-box audio models without compromising their original performance. Through several metrics, he demonstrates that the explanations generated by his approach remain faithful to the original model and are both listenable and understandable.

  • [Nov 21th, 2024]

  • PARAMETER AVERAGING IS ALL YOU NEED TO PREVENT FORGETTING
    Presenter: Peter Plantinga McGill University

    Peter Plantinga is a Postdoctoral Researcher at McGill University’s Department of Neurology and Neurosurgery, where his research leverages speech and audio data to develop biomarkers for neurodegenerative diseases. With a long-standing passion for applying AI to assistive technologies, Peter has published extensively on enhancing speech intelligibility in noisy environments for both human listeners and automated systems. He is a core developer of the open-source SpeechBrain toolkit, widely used in the speech processing and conversational AI communities, and previously led speech AI projects at JPMorganChase’s Machine Learning Center of Excellence, contributing to several patents in conversational AI technologies. Peter’s current work sits at the intersection of neuroscience and AI, aiming to advance the understanding and treatment of different neurological disorders through innovations in interpretable machine learning for voice analysis.


    Continual learning in end-to-end automatic speech recognition (E2E-ASR) often suffers from catastrophic forgetting, where fine-tuning leads to significant performance degradation on previously seen data. While adapters offer a way to switch between fine-tuned models, they still underperform in unseen domains—a challenge when the input domain is unknown. We propose a method that reduces forgetting to just 3.4%, significantly outperforming fine-tuning strategies like LoRA, which exhibits a 49% forgetting rate. By linearly interpolating the parameters of multiple models fine-tuned from the same generalist model, we achieve a unified model that excels across diverse datasets. Moreover, this model can be iteratively fine-tuned and averaged while maintaining low forgetting rates. Our experiments demonstrate the robustness of this approach across various datasets and models, presenting a promising solution for continual learning in E2E-ASR.

  • Organizers

    profile picture

    Pooneh Mousavi (she/her) is a computer science PhD student at Mila and Concordia University, supervised by Professor Mirco Ravanelli. She has a broad interest in deep learning for Conversational AI. Her research focuses on discrete self-supervised learning for speech and audio, exploring its potential to bridge audio and language models. She is also one of the main contributors to the SpeechBrain project, a popular open-source conversational AI toolkit.
    Website, Google Scholar, Linkedin

    profile picture

    Hiba Akhaddar (she/her) is a master’s student majoring in Computer Science at Concordia University and Mila. She is supervised by Pr. Tristan Glatard and Pr. Mirco Ravanelli. Her interests revolve around the applications of Deep Learning in the Medical field. She works on the detection and progression of Parkinson’s Disease from speech.
    Website, Linkedin