FedVCPL-Diff: A federated convolutional prototype learning framework with a diffusion model for speech emotion recognition

Ruobing Li; Yifan Feng; Lin Shen; Liuxian Ma; Haojie Zhang; Kun Qian; Bin Hu; Yoshiharu Yamamoto; Björn W. Schuller

doi:10.1016/j.inffus.2025.103745

FedVCPL-Diff: A federated convolutional prototype learning framework with a diffusion model for speech emotion recognition

Ruobing Li, Yifan Feng, Lin Shen, Liuxian Ma, Haojie Zhang, Kun Qian^*, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller

^*此作品的通讯作者

医学技术学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Speech Emotion Recognition (SER), a key emotion analysis technology, has shown significant value in various research areas. Previous SER models have achieved good emotion recognition accuracy, but typical centrally-based training requires centralised processing of speech data, which has a serious risk of privacy leakage. Federated learning (FL) can avoid centralised data processing through distributed learning, providing a solution for privacy protection in SER. However, FL faces several challenges in practical applications, including imbalanced data distribution and inconsistent labelling. Furthermore, typical FL frameworks focus on client-side enhancement and ignore server-side aggregation strategy optimisation, which can increase the computational load on the client side. To address the aforementioned problems, we propose a novel approach, FedVCPL-Diff. Firstly, regarding information fusion, we introduce a diffusion model on the server side to generate Valence-Arousal-Dominance emotion space features, which replaces the typical aggregation framework and effectively promotes global information fusion. In addition, in terms of information exchange, we propose a lightweight and personalised FL transmission framework based on the exchange of VAD features. FedVCPL-Diff optimises the local model by updating the data distribution anchors, which not only avoids the privacy risk but also reduces the communication cost. Experimental results show that the framework significantly improves emotion recognition performance compared to four commonly used FL frameworks. The overall performance of our framework also shows a significant advantage compared to locally independent models.

源语言	英语
文章编号	103745
期刊	Information Fusion
卷	127
DOI	http://doi.org/10.1016/j.inffus.2025.103745
出版状态	已出版 - 3月 2026

访问文件

10.1016/j.inffus.2025.103745

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{8d2255cd4f084f56a5ab841f550e455d,

title = "FedVCPL-Diff: A federated convolutional prototype learning framework with a diffusion model for speech emotion recognition",

abstract = "Speech Emotion Recognition (SER), a key emotion analysis technology, has shown significant value in various research areas. Previous SER models have achieved good emotion recognition accuracy, but typical centrally-based training requires centralised processing of speech data, which has a serious risk of privacy leakage. Federated learning (FL) can avoid centralised data processing through distributed learning, providing a solution for privacy protection in SER. However, FL faces several challenges in practical applications, including imbalanced data distribution and inconsistent labelling. Furthermore, typical FL frameworks focus on client-side enhancement and ignore server-side aggregation strategy optimisation, which can increase the computational load on the client side. To address the aforementioned problems, we propose a novel approach, FedVCPL-Diff. Firstly, regarding information fusion, we introduce a diffusion model on the server side to generate Valence-Arousal-Dominance emotion space features, which replaces the typical aggregation framework and effectively promotes global information fusion. In addition, in terms of information exchange, we propose a lightweight and personalised FL transmission framework based on the exchange of VAD features. FedVCPL-Diff optimises the local model by updating the data distribution anchors, which not only avoids the privacy risk but also reduces the communication cost. Experimental results show that the framework significantly improves emotion recognition performance compared to four commonly used FL frameworks. The overall performance of our framework also shows a significant advantage compared to locally independent models.",

keywords = "Diffusion model, Federated learning, Information fusion, Prototype learning, Speech emotion recognition",

author = "Ruobing Li and Yifan Feng and Lin Shen and Liuxian Ma and Haojie Zhang and Kun Qian and Bin Hu and Yoshiharu Yamamoto and Schuller, \{Bj{\"o}rn W.\}",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier B.V.",

year = "2026",

month = mar,

doi = "10.1016/j.inffus.2025.103745",

language = "English",

volume = "127",

journal = "Information Fusion",

issn = "1566-2535",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - FedVCPL-Diff

T2 - A federated convolutional prototype learning framework with a diffusion model for speech emotion recognition

AU - Li, Ruobing

AU - Feng, Yifan

AU - Shen, Lin

AU - Ma, Liuxian

AU - Zhang, Haojie

AU - Qian, Kun

AU - Hu, Bin

AU - Yamamoto, Yoshiharu

AU - Schuller, Björn W.

PY - 2026/3

Y1 - 2026/3

N2 - Speech Emotion Recognition (SER), a key emotion analysis technology, has shown significant value in various research areas. Previous SER models have achieved good emotion recognition accuracy, but typical centrally-based training requires centralised processing of speech data, which has a serious risk of privacy leakage. Federated learning (FL) can avoid centralised data processing through distributed learning, providing a solution for privacy protection in SER. However, FL faces several challenges in practical applications, including imbalanced data distribution and inconsistent labelling. Furthermore, typical FL frameworks focus on client-side enhancement and ignore server-side aggregation strategy optimisation, which can increase the computational load on the client side. To address the aforementioned problems, we propose a novel approach, FedVCPL-Diff. Firstly, regarding information fusion, we introduce a diffusion model on the server side to generate Valence-Arousal-Dominance emotion space features, which replaces the typical aggregation framework and effectively promotes global information fusion. In addition, in terms of information exchange, we propose a lightweight and personalised FL transmission framework based on the exchange of VAD features. FedVCPL-Diff optimises the local model by updating the data distribution anchors, which not only avoids the privacy risk but also reduces the communication cost. Experimental results show that the framework significantly improves emotion recognition performance compared to four commonly used FL frameworks. The overall performance of our framework also shows a significant advantage compared to locally independent models.

AB - Speech Emotion Recognition (SER), a key emotion analysis technology, has shown significant value in various research areas. Previous SER models have achieved good emotion recognition accuracy, but typical centrally-based training requires centralised processing of speech data, which has a serious risk of privacy leakage. Federated learning (FL) can avoid centralised data processing through distributed learning, providing a solution for privacy protection in SER. However, FL faces several challenges in practical applications, including imbalanced data distribution and inconsistent labelling. Furthermore, typical FL frameworks focus on client-side enhancement and ignore server-side aggregation strategy optimisation, which can increase the computational load on the client side. To address the aforementioned problems, we propose a novel approach, FedVCPL-Diff. Firstly, regarding information fusion, we introduce a diffusion model on the server side to generate Valence-Arousal-Dominance emotion space features, which replaces the typical aggregation framework and effectively promotes global information fusion. In addition, in terms of information exchange, we propose a lightweight and personalised FL transmission framework based on the exchange of VAD features. FedVCPL-Diff optimises the local model by updating the data distribution anchors, which not only avoids the privacy risk but also reduces the communication cost. Experimental results show that the framework significantly improves emotion recognition performance compared to four commonly used FL frameworks. The overall performance of our framework also shows a significant advantage compared to locally independent models.

KW - Diffusion model

KW - Federated learning

KW - Information fusion

KW - Prototype learning

KW - Speech emotion recognition

UR - http://www.scopus.com/pages/publications/105016778608

U2 - 10.1016/j.inffus.2025.103745

DO - 10.1016/j.inffus.2025.103745

M3 - Article

AN - SCOPUS:105016778608

SN - 1566-2535

VL - 127

JO - Information Fusion

JF - Information Fusion

M1 - 103745

ER -

FedVCPL-Diff: A federated convolutional prototype learning framework with a diffusion model for speech emotion recognition

摘要

访问文件

其它文件与链接

指纹

引用此