Back to All Projects

Intrusive Lyric Intelligibility (ICASSP Cadenza)

Hybrid AttentionEnsemble LearningAudio FlamingoICASSP 2026Python

The Challenge

The Cadenza ICASSP 2026 Challenge tasked systems to predict lyric intelligibility rates from perceptual experiments on accompanied singing, including simulated hearing loss. The main difficulty was building an intrusive model that accounts for acoustic quality, lexical alignment, and contextual cues in music mixtures.

My Role & Collaborators

I was responsible for the key orchestration, planning, and delivering results for the entire project. Our team from Aalto University secured 2nd place in the final evaluation rankings.

Our Solution

We built a system (T071a) that combines a hybrid neural mixture of experts (WavLM/Wav2Vec2/Whisper) performing attention pooling over time-series embeddings, fused with tree-based regressors (LightGBM, XGBoost, CatBoost). Scalar feature blocks summarized intrusive perceptual metrics (STOI, PESQ, Zimtohrli), ASR stability, and linguistic complexity. A multi-modal LLM (Audio Flamingo) score provided human-like rating priors.

Key Techniques

  • Hybrid Neural Mixture of Experts (WavLM/Wav2Vec2/Whisper)
  • Tree-based Regressors (LightGBM, XGBoost, CatBoost)
  • Intrusive Perceptual Metrics (Zimtohrli, STOI)
  • Multi-modal LLM scoring (Audio Flamingo 3)
  • Attention Pooling

Results

The system achieved a Root Mean Squared Error (RMSE) of 0.265 and Normalized Cross-Correlation (NCC) of 0.69 on the official evaluation set, ranking 2nd place.

System Architecture

Intrusive Lyric Intelligibility (ICASSP Cadenza) Architecture