| Citation: | LING Xiaoming, ZHANG Zhen, LIU Yanshan, MA Chenlong, CHEN Qixiang. Multi-Scale Correlated iAFF-Res2Net Voiceprint Recognition Model[J].Journal of Southwest Jiaotong University, 2025, 60(6): 1499-1507, 1518.doi:10.3969/j.issn.0258-2724.20240317 |
An improved network architecture named iAFF-Res2Net was proposed to address the problem of capturing both local and global features effectively in feature fusion under noisy conditions for voiceprint recognition. Firstly, a lightweight Res2Net was adopted as the backbone, and an Attention Feature Fusion (AFF) module with multi-scale channel attention was integrated to enhance feature representation. Secondly, an Iterative Attention Feature Fusion (iAFF) mechanism was introduced to reduce the bias caused by initial feature weighting, thereby refining feature weight distributions. Furthermore, during the generation of fixed speaker embeddings, traditional average pooling was replaced by Attentive Statistics Pooling (ASTP) with global context attention to improve frame-level feature aggregation. Finally, Angular Additive Margin Softmax (ArcMarginLoss) was used for speaker verification. Experimental results on the VoxCeleb dataset show that the iAFF-Res2Net models achieve equal error rates (EER) of 0.60%, and minimum detection cost functions (MinDCF) of 0.053. Both models outperform classical architectures such as ResNet34, ResNet50, Res2Net, and ECAPA-TDNN. Moreover, the improved models exhibit faster convergence and stronger feature discriminability.

| [1] |
李平, 高清源, 夏宇, 等. 基于SE-DR-Res2Block的声纹识别方法[J]. 工程科学学报, 2023, 45(11): 1962-1969.
LI Ping, GAO Qingyuan, XIA Yu, et al. Voiceprint recognition method based on SE-DR-Res2Block[J]. Chinese Journal of Engineering, 2023, 45(11): 1962-1969.
|
| [2] |
李如玮, 潘冬梅, 张爽, 等. 基于Gammatone滤波器分解的HRTF和GMM的双耳声源定位算法[J]. 北京工业大学学报, 2018, 44(11): 1385-1390.
doi:10.11936/bjutxb2017090015
LI Ruwei, PAN Dongmei, ZHANG Shuang, et al. Binaural sound source localization algorithm based on HRTF and GMM under gammatone filter decomposition[J]. Journal of Beijing University of Technology, 2018, 44(11): 1385-1390.
doi:10.11936/bjutxb2017090015
|
| [3] |
牛晓可, 黄伊鑫, 徐华兴, 等. 基于听皮层神经元感受野的强噪声环境下说话人识别[J]. 计算机应用, 2020, 40(10): 3034-3040.
NIU Xiaoke, HUANG Yixin, XU Huaxing, et al. Speaker recognition in strong noise environment based on auditory cortical neuronal receptive field[J]. Journal of Computer Applications, 2020, 40(10): 3034-3040.
|
| [4] |
许东星, 戴蓓蒨, 刘青松, 等. 基于超音段韵律特征和GMM-UBM的文本无关的说话人识别[J]. 中国科学技术大学学报, 2010, 40(2): 157-162.
doi:10.3969/j.issn.0253-2778.2010.02.009
XU Dongxing, DAI Beiqian, LIU Qingsong, et al. Text-independent speaker recognition based on super-segment prosodic feature and GMM-UBM[J]. Journal of University of Science and Technology of China, 2010, 40(2): 157-162.
doi:10.3969/j.issn.0253-2778.2010.02.009
|
| [5] |
张庆芳, 赵鹤鸣, 龚呈卉. 基于因子分析和特征映射的耳语说话人识别[J]. 数据采集与处理, 2016, 31(2): 362-369.
ZHANG Qingfang, ZHAO Heming, GONG Chenghui. Whispered speaker identification based on factor analysis and feature mapping[J]. Journal of Data Acquisition and Processing, 2016, 31(2): 362-369.
|
| [6] |
王明合, 唐振民, 张二华. 基于I-Vector局部加权线性判别分析的说话人识别[J]. 仪器仪表学报, 2015, 36(12): 2842-2848.
doi:10.3969/j.issn.0254-3087.2015.12.026
WANG Minghe, TANG Zhenmin, ZHANG Erhua. I-Vector based speaker recognition using local weighted linear discriminant analysis[J]. Chinese Journal of Scientific Instrument, 2015, 36(12): 2842-2848.
doi:10.3969/j.issn.0254-3087.2015.12.026
|
| [7] |
韩佳俊, 马志强, 王洪彬, 等. 基于I-Vector特征融合的蒙古语说话人特征提取方法[J]. 中文信息学报, 2023, 37(1): 71-78.
doi:10.3969/j.issn.1003-0077.2023.01.007
HAN Jiajun, MA Zhiqiang, WANG Hongbin, et al. A speaker feature extraction method based on I-Vector resource fusion[J]. Journal of Chinese Information Processing, 2023, 37(1): 71-78.
doi:10.3969/j.issn.1003-0077.2023.01.007
|
| [8] |
刘晓璇, 季怡, 刘纯平. 基于LSTM神经网络的声纹识别[J]. 计算机科学, 2021, 48(增2): 270-274.
LIU Xiaoxuan, JI Yi, LIU Chunping. Voiceprint recognition based on LSTM neural network[J]. Computer Science, 2021, 48(S2): 270-274.
|
| [9] |
HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016: 770-778.
|
| [10] |
GAO S H, CHENG M M, ZHAO K, et al. Res2Net: a new multi-scale backbone architecture[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652-662.
doi:10.1109/TPAMI.2019.2938758
|
| [11] |
陈志高, 李鹏, 肖润秋, 等. 文本无关说话人识别的一种多尺度特征提取方法[J]. 电子与信息学报, 2021, 43(11): 3266-3271.
doi:10.11999/JEIT200917
CHEN Zhigao, LI Peng, XIAO Runqiu, et al. A multiscale feature extraction method for text-independent speaker recognition[J]. Journal of Electronics & Information Technology, 2021, 43(11): 3266-3271.
doi:10.11999/JEIT200917
|
| [12] |
YADAV S, RAI A. Frequency and temporal convolutional attention for text-independent speaker recognition[C]//ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020: 6794-6798.
|
| [13] |
DAWALATABAD N, RAVANELLI M, GRONDIN F, et al. ECAPA-TDNN embeddings for speaker diarization[C]//Interspeech 2021. ISCA: [s.n.], 2021: 3560-3564.
|
| [14] |
DAI Y M, GIESEKE F, OEHMCKE S, et al. Attentional feature fusion[C]//2021 IEEE Winter Conference on Applications of Computer Vision (WACV). Waikoloa: IEEE, 2021: 3559-3568.
|
| [15] |
邓力洪, 邓飞, 张葛祥, 等. 改进Res2Net的多尺度端到端说话人识别系统[J]. 计算机工程与应用, 2023, 59(24): 110-120.
doi:10.3778/j.issn.1002-8331.2208-0085
DENG Lihong, DENG Fei, ZHANG Gexiang, et al. Multi-scale end-to-end speaker recognition system based on improved Res2Net[J]. Computer Engineering and Applications, 2023, 59(24): 110-120.
doi:10.3778/j.issn.1002-8331.2208-0085
|
| [16] |
李泽琛, 李恒超, 胡文帅, 等. 多尺度注意力学习的Faster R-CNN口罩人脸检测模型[J]. 江南娱乐网页版入口官网下载安装学报, 2021, 56(5): 1002-1010.
doi:10.3969/j.issn.0258-2724.20210017
LI Zechen, LI Hengchao, HU Wenshuai, et al. Masked face detection model based on multi-scale attention-driven faster R-CNN[J]. Journal of Southwest Jiaotong University, 2021, 56(5): 1002-1010.
doi:10.3969/j.issn.0258-2724.20210017
|
| [17] |
THEVAGUMARAN R, SIVANESWARAN T, KARUNARATHNE B. Enhanced feature aggregation for deep neural network based speaker embedding[C]// 2022 Moratuwa Engineering Research Conference (MERCon). Moratuwa: IEEE, 2022: 1-5.
|
| [18] |
宋昕洋, 阎志远, 孙沐毅, 等. 说话人生成研究现状与发展趋势[J]. 计算机科学, 2023, 50(8): 68-78.
doi:10.11896/jsjkx.221000031
SONG Xinyang, YAN Zhiyuan, SUN Muyi, et al. Review of talking face generation[J]. Computer Science, 2023, 50(8): 68-78.
doi:10.11896/jsjkx.221000031
|