Imperceptible Black-box Waveform-level Adversarial Attack Towards Automatic Speaker Recognition

Abstract

Automatic speaker recognition is an important biometric authentication approach with emerging applications. However, recent research has shown its vulnerability on adversarial attacks. In this paper, we propose a new type of adversarial examples by generating imperceptible adversarial samples for targeted attacks on black-box systems of automatic speaker recognition. Waveform samples are created directly by solving an optimization problem with waveform inputs and outputs, which is more realistic in real-life scenario. Inspired by auditory masking, a regularization term adapting to the energy of speech waveform is proposed for generating imperceptible adversarial perturbations. The optimization problems are subsequently solved by differential evolution algorithm in a black-box manner which does not require any knowledge on the inner configuration of the recognition systems. Experiments conducted on commonly used datasets, LibriSpeech and VoxCeleb, show that the proposed methods have successfully performed targeted attacks on state-of-the-art speaker recognition systems while being imperceptible to human listeners. Given the high SNR and PESQ scores of the yielded adversarial samples, the proposed methods deteriorate less on the quality of the original signals than several recently proposed methods, which justifies the imperceptibility of adversarial samples.

Proposed Method

In this paper, we consider the attacking scenario depicted in Fig.1. The adversarial attacking takes the following steps. An attacker firstly has the source voice. An adversarial learning algorithm subsequently generates a perturbation signal based on the source voice and adds them together to obtain an adversarial voice. Human listener cannot tell the difference between the source voice and the adversarial one. However, the speaker recognition model will be fooled to recognize the adversarial voice as from speaker A in miscellaneous tasks. This characteristic of adversarial examples can be utilized to protect privacy from being identified by unknown deployed speaker recognition systems. In addition, it prevents our voiceprint features from being collected maliciously. A feasible method for crafting an adversarial example is adding a tiny amount of well-tuned additive perturbation on the source voice to fool speaker recognition. There are generally two kinds of attacks according to how to fool speaker recognition. One is untargeted attack, in which speaker recognition fails to identify the correct identity of the modified voice. The other is targeted attack, in which speaker recognition recognizes the identity of an adversarial sample as the specific speaker. A key property of a successful adversarial attack is the difference between the adversarial sample and the source one should be imperceptible to human perception. In this paper, inspired by auditory masking, we will improve the imperceptibility of adversarial samples by constraining both the amounts and the amplitudes of the adversarial perturbations. The less the prior knowledge required by an attack, the easier the attack conducted in practical usage. Given the assumption that an attacker does not have any knowledge on the inner configuration of the recognition systems, we focus on the black-box adversarial attack in this paper where an attacker can at most access the decision results or the scores of predictions. In this paper we make adversarial audio samples by only modifying partial points of an utterance. Given the fact that excessive large amplitudes in audio samples would produce harsh noises, constraints on the amplitudes of adversarial samples will be considered in our methods. In this paper, we generate adversarial perturbations directly on waveform-level (and not on the spectrogram) to yield high-quality samples for attacking, which is more realistic in real-life scenarios. There are three typical tasks for automatic speaker recognition, say open-set identification (OSI) [1], close-set identification (CSI) [2] and automatic speaker verification (ASV) [3]. In this paper, we comprehensively study targeted adversarial attacks towards all these three tasks within the proposed framework.

generation of adversarial examples based on frequency masking

Figure1: The attacking scenario of our work.

Speech Samples

Original → Adversarial

	Original （Original label：61 Predict label：61）	Adversarial （Original label：61 Predict label：5105）
Sample 1

	Original （Original label：61 Predict label：61）	Adversarial （Original label：61 Predict label：3729）
Sample 2

	Original （Original label：61 Predict label：61）	Adversarial （Original label：61 Predict label：4446）
Sample 3

	Original （Original label：61 Predict label：61）	Adversarial （Original label：61 Predict label：7021）
Sample 4

	Original （Original label：237 Predict label：237）	Adversarial （Original label：61 Predict label：5105）
Sample 5

	Original （Original label：237 Predict label：237）	Adversarial （Original label：61 Predict label：3729）
Sample 6

	Original （Original label：237 Predict label：237）	Adversarial （Original label：61 Predict label：3729）
Sample 7

	Original （Original label：237 Predict label：237）	Adversarial （Original label：61 Predict label：4446）
Sample 8

	Original （Original label：237 Predict label：237）	Adversarial （Original label：61 Predict label：7021）
Sample 9

	Original （Original label：237 Predict label：237）	Adversarial （Original label：61 Predict label：7729）
Sample 10

	Original （Original label：260 Predict label：260）	Adversarial （Original label：260 Predict label：5105）
Sample 11

	Original （Original label：260 Predict label：260）	Adversarial （Original label：260 Predict label：3729）
Sample 12

	Original （Original label：260 Predict label：260）	Adversarial （Original label：260 Predict label：4446）
Sample 13

	Original （Original label：260 Predict label：260）	Adversarial （Original label：260 Predict label：7021）
Sample 14

	Original （Original label：260 Predict label：260）	Adversarial （Original label：260 Predict label：7729）
Sample 15

	Original （Original label：1580 Predict label：1580）	Adversarial （Original label：1580 Predict label：5105）
Sample 16

	Original （Original label：1580 Predict label：1580）	Adversarial （Original label：1580 Predict label：3729）
Sample 17

	Original （Original label：1580 Predict label：1580）	Adversarial （Original label：1580 Predict label：4446）
Sample 18

	Original （Original label：1580 Predict label：1580）	Adversarial （Original label：1580 Predict label：7021）
Sample 19

	Original （Original label：1580 Predict label：1580）	Adversarial （Original label：1580 Predict label：7729）
Sample 20

	Original （Original label：2830 Predict label：2830）	Adversarial （Original label：2830 Predict label：5105）
Sample 21

	Original （Original label：2830 Predict label：2830）	Adversarial （Original label：2830 Predict label：3729）
Sample 22

	Original （Original label：2830 Predict label：2830）	Adversarial （Original label：2830 Predict label：4446）
Sample 23

	Original （Original label：2830 Predict label：2830）	Adversarial （Original label：2830 Predict label：7021）
Sample 24

	Original （Original label：2830 Predict label：2830）	Adversarial （Original label：2830 Predict label：7729）
Sample 25

References

[1] K. Wilkinghoff. "On Open-Set Speaker Identification with I-Vectors. " in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan, May 2020, pp. 408-414.

[2] T. Liu and S. Guan. "Factor Analysis Method for Text-Independent Speaker Identification." Journal of Software (JSW), vol.9, no.11, pp.2851-2860, Nov. 2014.

[3] D. Snyder, D. Garcia-Romero, D. Povey and S. Khudanpur. "Deep Neural Network Embeddings for Text-Independent Speaker Verification." in Proc. 2017 18th Annual Conference of the International Speech Communication Association (INTERSPEECH), Stockholm, Sweden, Aug. 2017, pp. 999-1003.