Can Emotion Fool Anti-spoofing?

(This work is accepted to Interspeech 2025)

Aurosweta Mahapatra 1, Ismail Rasim Ulgen 1, Abinay Reddy Naini 2, Carlos Busso 2, Berrak Sisman 1,3
amahapa2@jhu.edu   iulgen1@jhu.edu   anaini@andrew.cmu.edu   busso@cmu.edu   sisman@jhu.edu
1 Center for Language and Speech Processing, Johns Hopkins University, USA
2 Language Technologies Institute, Carnegie Mellon University, USA
3 Data Science and AI Institute (DSAI), Johns Hopkins University, USA


Dataset Link: https://zenodo.org/records/15557842

Abstract: Traditional anti-spoofing focuses on models and datasets built on synthetic speech with mostly neutral state, neglecting diverse emotional variations. As a result, their robustness against high-quality, emotionally expressive synthetic speech is uncertain. We address this by introducing EmoSpoof-TTS, a corpus of emotional text-to-speech samples. Our analysis shows existing anti-spoofing models struggle with emotional synthetic speech, exposing risks of emotion-targeted attacks. Even trained on emotional data, the models underperform due to limited focus on emotional aspects and show performance disparities across emotions. This highlights the need for an emotion-focused anti-spoofing paradigm in both dataset and methodology. We propose GEM, a gated ensemble of emotion-specialized models with a speech emotion recognition gating network. GEM performs effectively across all emotions and neutral state, improving defenses against spoofing attacks. We release the EmoSpoof-TTS Dataset.

Model Architecture of Proposed Method


Model Architecture Diagram

Figure: Proposed Gated Ensemble Method (GEM)


Details of EmoSpoof-TTS Dataset

We introduce and release EmoSpoof-TTS, a corpus of emotionally expressive synthetic speech generated using recent text-to-speech models, to facilitate research on the impact of emotion on anti-spoofing models. It contains a total of 36,000 synthesized speech samples from four emotions (Happiness, Anger, Sadness, and Neutral state), 10 (5 male, 5 female) speakers, and 3 TTS models (StyleTTS2 [1], F5-TTS [2], CosyVoice [3]). The bona-fide samples are from Emotional Speech Database (ESD) [4].

Speech Samples

Happy Angry Sad Neutral
Speaker: 0012
Bona-fide
StyleTTS2
CosyVoice
F5-TTS
Speaker: 0014
Bona-fide
StyleTTS2
CosyVoice
F5-TTS
Speaker: 0018
Bona-fide
StyleTTS2
CosyVoice
F5-TTS

References

  1. Li, Yinghao Aaron, et al. "StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models." Advances in Neural Information Processing Systems 36 (2023): 19594-19621.
  2. Chen, Yushen, et al. "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching." arXiv preprint arXiv:2410.06885 (2024).
  3. Du, Zhihao, et al. "CosyVoice: A Scalable Multilingual Zero-Shot Text-to-Speech Synthesizer Based on Supervised Semantic Tokens." arXiv preprint arXiv:2407.05407 (2024).
  4. Zhou, Kun, et al. "Seen and Unseen Emotional Style Transfer for Voice Conversion with a New Emotional Speech Dataset." ICASSP 2021 - IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2021.