arxiv:2505.20756

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

Published on May 27

Authors:

Pankaj Wasnik ,

Abstract

Using speaker representations from time-reversed speech as an augmentation strategy enhances speaker similarity in diffusion-based voice conversion models while maintaining speech quality.

AI-generated summary

Speech time reversal refers to the process of reversing the entire speech signal in time, causing it to play backward. Such signals are completely unintelligible since the fundamental structures of phonemes and syllables are destroyed. However, they still retain tonal patterns that enable perceptual speaker identification despite losing linguistic content. In this paper, we propose leveraging speaker representations learned from time reversed speech as an augmentation strategy to enhance speaker representation. Notably, speaker and language disentanglement in voice conversion (VC) is essential to accurately preserve a speaker's unique vocal traits while minimizing interference from linguistic content. The effectiveness of the proposed approach is evaluated in the context of state-of-the-art diffusion-based VC models. Experimental results indicate that the proposed approach significantly improves speaker similarity-related scores while maintaining high speech quality.

View arXiv page View PDF Add to collection