Sound 12
☆ Diffusion Buffer for Online Generative Speech Enhancement
                                          Online Speech Enhancement was mainly reserved for predictive models. A key
advantage of these models is that for an incoming signal frame from a stream of
data, the model is called only once for enhancement. In contrast, generative
Speech Enhancement models often require multiple calls, resulting in a
computational complexity that is too high for many online speech enhancement
applications. This work presents the Diffusion Buffer, a generative
diffusion-based Speech Enhancement model which only requires one neural network
call per incoming signal frame from a stream of data and performs enhancement
in an online fashion on a consumer-grade GPU. The key idea of the Diffusion
Buffer is to align physical time with Diffusion time-steps. The approach
progressively denoises frames through physical time, where past frames have
more noise removed. Consequently, an enhanced frame is output to the listener
with a delay defined by the Diffusion Buffer, and the output frame has a
corresponding look-ahead. In this work, we extend upon our previous work by
carefully designing a 2D convolutional UNet architecture that specifically
aligns with the Diffusion Buffer's look-ahead. We observe that the proposed
UNet improves performance, particularly when the algorithmic latency is low.
Moreover, we show that using a Data Prediction loss instead of Denoising Score
Matching loss enables flexible control over the trade-off between algorithmic
latency and quality during inference. The extended Diffusion Buffer equipped
with a novel NN and loss function drastically reduces the algorithmic latency
from 320 - 960 ms to 32 - 176 ms with an even increased performance. While it
has been shown before that offline generative diffusion models outperform
predictive approaches in unseen noisy speech data, we confirm that the online
Diffusion Buffer also outperforms its predictive counterpart on unseen noisy
speech data.
                                    
                                ☆ Adapting Language Balance in Code-Switching Speech ICASSP 2026
                                          Despite achieving impressive results on standard benchmarks, large
foundational models still struggle against code-switching test cases. When data
scarcity cannot be used as the usual justification for poor performance, the
reason may lie in the infrequent occurrence of code-switched moments, where the
embedding of the second language appears subtly. Instead of expecting the
models to learn this infrequency on their own, it might be beneficial to
provide the training process with labels. Evaluating model performance on
code-switching data requires careful localization of code-switching points
where recognition errors are most consequential, so that the analysis
emphasizes mistakes occurring at those moments. Building on this observation,
we leverage the difference between the embedded and the main language to
highlight those code-switching points and thereby emphasize learning at those
locations. This simple yet effective differentiable surrogate mitigates context
bias during generation -- the central challenge in code-switching -- thereby
improving the model's robustness. Our experiments with Arabic and
Chinese-English showed that the models are able to predict the switching places
more correctly, reflected by the reduced substitution error.
                                    
                                        
                                            comment: Submitted to ICASSP 2026
                                        
                                ☆ Bayesian Low-Rank Factorization for Robust Model Adaptation ICASSP 2026
                                          Large speech foundation models achieve strong performance across many
domains, but they often require adaptation to handle local needs such as
code-switching, where speakers mix languages within the same utterance. Direct
fine-tuning of these models risks overfitting to the target domain and
overwriting the broad capabilities of the base model. To address this
challenge, we explore Bayesian factorized adapters for speech foundation
models, which place priors near zero to achieve sparser adaptation matrices and
thereby retain general performance while adapting to specific domains. We apply
our approach to the Whisper model and evaluate on different multilingual
code-switching scenarios. Our results show only minimal adaptation loss while
significantly reducing catastrophic forgetting of the base model. Compared to
LoRA, our method achieves a backward gain of 54% with only a 4% drop on the new
domain. These findings highlight the effectiveness of Bayesian adaptation for
fine-tuning speech foundation models without sacrificing generalization.
                                    
                                        
                                            comment: Submitted to ICASSP 2026
                                        
                                ☆ MLMA: Towards Multilingual with Mamba Based Architectures ICASSP 2026
                                          Multilingual automatic speech recognition (ASR) remains a challenging task,
especially when balancing performance across high- and low-resource languages.
Recent advances in sequence modeling suggest that architectures beyond
Transformers may offer better scalability and efficiency. In this work, we
introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new
approach that leverages the Mamba architecture--an efficient state-space model
optimized for long-context sequence processing--for multilingual ASR. Using
Mamba, MLMA implicitly incorporates language-aware conditioning and shared
representations to support robust recognition across diverse languages.
Experiments on standard multilingual benchmarks show that MLMA achieves
competitive performance compared to Transformer-based architectures. These
results highlight Mamba's potential as a strong backbone for scalable,
efficient, and accurate multilingual speech recognition.
                                    
                                        
                                            comment: The paper is under review at ICASSP 2026
                                        
                                ☆ Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification
                                          Robust speaker verification under noisy conditions remains an open challenge.
Conventional deep learning methods learn a robust unified speaker
representation space against diverse background noise and achieve significant
improvement. In contrast, this paper presents a noise-conditioned
mixture-ofexperts framework that decomposes the feature space into specialized
noise-aware subspaces for speaker verification. Specifically, we propose a
noise-conditioned expert routing mechanism, a universal model based expert
specialization strategy, and an SNR-decaying curriculum learning protocol,
collectively improving model robustness and generalization under diverse noise
conditions. The proposed method can automatically route inputs to expert
networks based on noise information derived from the inputs, where each expert
targets distinct noise characteristics while preserving speaker identity
information. Comprehensive experiments demonstrate consistent superiority over
baselines, confirming that explicit noise-dependent feature modeling
significantly enhances robustness without sacrificing verification accuracy.
                                    
                                ☆ A Stage-Wise Learning Strategy with Fixed Anchors for Robust Speaker Verification
                                          Learning robust speaker representations under noisy conditions presents
significant challenges, which requires careful handling of both discriminative
and noise-invariant properties. In this work, we proposed an anchor-based
stage-wise learning strategy for robust speaker representation learning.
Specifically, our approach begins by training a base model to establish
discriminative speaker boundaries, and then extract anchor embeddings from this
model as stable references. Finally, a copy of the base model is fine-tuned on
noisy inputs, regularized by enforcing proximity to their corresponding fixed
anchor embeddings to preserve speaker identity under distortion. Experimental
results suggest that this strategy offers advantages over conventional joint
optimization, particularly in maintaining discrimination while improving noise
robustness. The proposed method demonstrates consistent improvements across
various noise conditions, potentially due to its ability to handle boundary
stabilization and variation suppression separately.
                                    
                                ☆ ProLAP: Probabilistic Language-Audio Pre-Training
                                          Language-audio joint representation learning frameworks typically depend on
deterministic embeddings, assuming a one-to-one correspondence between audio
and text. In real-world settings, however, the language-audio relationship is
inherently many-to-many: one audio segment can be described by multiple
captions and vice versa. To address this, we propose Probabilistic
Language-Audio Pre-training (ProLAP), which models multiplicity as the spread
of probability distributions in a joint language-audio embedding space. To
train the intra-modal hierarchical relationship effectively, we also introduce
two objectives: (i) hierarchical inclusion loss to promote semantic
hierarchical understanding of inputs and (ii) mask repulsive loss to improve
the efficiency of learning when optimizing the hierarchical inclusion loss.
With this training strategy, our model can learn the hierarchical structure
inherent in the data even from small datasets, in contrast to prior
probabilistic approaches that rely on large-scale datasets. In our experiments,
ProLAP outperforms existing deterministic approaches on audio-text retrieval
tasks. Moreover, through experiments on the audio traversal task introduced in
this paper, we demonstrate that ProLAP captures the plausible semantic
hierarchy.
                                    
                                        
                                            comment: Under review
                                        
                                ☆ SegTune: Structured and Fine-Grained Control for Song Generation
                                        
                                            
                                        
                                        
                                            
                                        
                                        Pengfei Cai, Joanna Wang, Haorui Zheng, Xu Li, Zihao Ji, Teng Ma, Zhongliang Liu, Chen Zhang, Pengfei Wan
                                    
                                    
                                          Recent advancements in song generation have shown promising results in
generating songs from lyrics and/or global text prompts. However, most existing
systems lack the ability to model the temporally varying attributes of songs,
limiting fine-grained control over musical structure and dynamics. In this
paper, we propose SegTune, a non-autoregressive framework for structured and
controllable song generation. SegTune enables segment-level control by allowing
users or large language models to specify local musical descriptions aligned to
song sections.The segmental prompts are injected into the model by temporally
broadcasting them to corresponding time windows, while global prompts influence
the whole song to ensure stylistic coherence. To obtain accurate segment
durations and enable precise lyric-to-music alignment, we introduce an
LLM-based duration predictor that autoregressively generates sentence-level
timestamped lyrics in LRC format. We further construct a large-scale data
pipeline for collecting high-quality songs with aligned lyrics and prompts, and
propose new evaluation metrics to assess segment-level alignment and vocal
attribute consistency. Experimental results show that SegTune achieves superior
controllability and musical coherence compared to existing baselines. See
https://cai525.github.io/SegTune_demo for demos of our work.
                                    
                                ☆ MVDR Beamforming for Cyclostationary Processes
                                          Conventional acoustic beamformers assume that noise is stationary within
short time frames. This assumption prevents them from exploiting correlations
between frequencies in almost-periodic noise sources such as musical
instruments, fans, and engines. These signals exhibit periodically varying
statistics and are better modeled as cyclostationary processes. This paper
introduces the cyclic MVDR (cMVDR) beamformer, an extension of the conventional
MVDR that leverages both spatial and spectral correlations to improve noise
reduction, particularly in low-SNR scenarios. The method builds on
frequency-shifted (FRESH) filtering, where shifted versions of the input are
combined to attenuate or amplify components that are coherent across frequency.
To address inharmonicity, where harmonic partials deviate from exact integer
multiples of the fundamental frequency, we propose a data-driven strategy that
estimates resonant frequencies via periodogram analysis and computes the
frequency shifts from their spacing. Analytical and experimental results
demonstrate that performance improves with increasing spectral correlation. On
real recordings, the cMVDR achieves up to 5 dB gain in scale-invariant
signal-to-distortion ratio (SI-SDR) over the MVDR and remains effective even
with a single microphone. Code is available at
https://github.com/Screeen/cMVDR.
                                    
                                        
                                            comment: Under review for publication from September 2025
                                        
                                ☆ ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation
                                          Controlling speaking style in text-to-speech (TTS) systems has become a
growing focus in both academia and industry. While many existing approaches
rely on reference audio to guide style generation, such methods are often
impractical due to privacy concerns and limited accessibility. More recently,
large language models (LLMs) have been used to control speaking style through
natural language prompts; however, their high computational cost, lack of
interpretability, and sensitivity to prompt phrasing limit their applicability
in real-time and resource-constrained environments. In this work, we propose
ParaStyleTTS, a lightweight and interpretable TTS framework that enables
expressive style control from text prompts alone. ParaStyleTTS features a novel
two-level style adaptation architecture that separates prosodic and
paralinguistic speech style modeling. It allows fine-grained and robust control
over factors such as emotion, gender, and age. Unlike LLM-based methods,
ParaStyleTTS maintains consistent style realization across varied prompt
formulations and is well-suited for real-world applications, including
on-device and low-resource deployment. Experimental results show that
ParaStyleTTS generates high-quality speech with performance comparable to
state-of-the-art LLM-based systems while being 30x faster, using 8x fewer
parameters, and requiring 2.5x less CUDA memory. Moreover, ParaStyleTTS
exhibits superior robustness and controllability over paralinguistic speaking
styles, providing a practical and efficient solution for style-controllable
text-to-speech generation. Demo can be found at
https://parastyletts.github.io/ParaStyleTTS_Demo/. Code can be found at
https://github.com/haoweilou/ParaStyleTTS.
                                    
                                ☆ Adaptive Per-Channel Energy Normalization Front-end for Robust Audio Signal Processing ICASSP2026
                                          In audio signal processing, learnable front-ends have shown strong
performance across diverse tasks by optimizing task-specific representation.
However, their parameters remain fixed once trained, lacking flexibility during
inference and limiting robustness under dynamic complex acoustic environments.
In this paper, we introduce a novel adaptive paradigm for audio front-ends that
replaces static parameterization with a closed-loop neural controller.
Specifically, we simplify the learnable front-end LEAF architecture and
integrate a neural controller for adaptive representation via dynamically
tuning Per-Channel Energy Normalization. The neural controller leverages both
the current and the buffered past subband energies to enable input-dependent
adaptation during inference. Experimental results on multiple audio
classification tasks demonstrate that the proposed adaptive front-end
consistently outperforms prior fixed and learnable front-ends under both clean
and complex acoustic conditions. These results highlight neural adaptability as
a promising direction for the next generation of audio front-ends.
                                    
                                        
                                            comment: Submitted to ICASSP2026
                                        
                                ☆ Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network ICASSP2026
                                          Estimating piano dynamic from audio recordings is a fundamental challenge in
computational music analysis. In this paper, we propose an efficient multi-task
network that jointly predicts dynamic levels, change points, beats, and
downbeats from a shared latent representation. These four targets form the
metrical structure of dynamics in the music score. Inspired by recent vocal
dynamic research, we use a multi-scale network as the backbone, which takes
Bark-scale specific loudness as the input feature. Compared to log-Mel as
input, this reduces model size from 14.7 M to 0.5 M, enabling long sequential
input. We use a 60-second audio length in audio segmentation, which doubled the
length of beat tracking commonly used. Evaluated on the public MazurkaBL
dataset, our model achieves state-of-the-art results across all tasks. This
work sets a new benchmark for piano dynamic estimation and delivers a powerful
and compact tool, paving the way for large-scale, resource-efficient analysis
of musical expression.
                                    
                                        
                                            comment: Paper submitted to ICASSP2026