Academic Editor

Article Metrics

  • Fig. 1.

    View in Article
    Full Image
  • Fig. 2.

    View in Article
    Full Image
  • Fig. 3.

    View in Article
    Full Image
  • Fig. 4.

    View in Article
    Full Image
  • Fig. 5.

    View in Article
    Full Image
  • Fig. 6.

    View in Article
    Full Image
  • Fig. 7.

    View in Article
    Full Image
  • Fig. 8.

    View in Article
    Full Image
  • Information

  • Download

  • Contents

Abstract

Background:

This study addressed three key challenges in subject-independent electroencephalography (EEG) emotion recognition: limited data availability, restricted cross-domain knowledge transfer, and suboptimal feature extraction. The aim is to develop an innovative framework that enhances recognition performance while preserving data privacy.

Methods:

This study introduces a novel multi-teacher knowledge distillation framework that incorporates data privacy considerations. The framework first comprises n subnets, each sequentially trained on distinct EEG datasets without data sharing. The subnets, excluding the initial one, acquire knowledge through the weights and features of all preceding subnets, enabling access to more EEG signals during the training process while maintaining privacy. To enhance cross-domain knowledge transfer, a multi-teacher knowledge distillation strategy was designed, featuring knowledge filters and adaptive multi-teacher knowledge distillation losses. The knowledge filter integrates cross-domain information using a multi-head attention module with a gate mechanism, ensuring effective inheritance of knowledge from all previous subnets. Simultaneously, the adaptive multi-teacher knowledge distillation loss dynamically adjusts the direction of knowledge transfer based on filtered feature similarity, preventing knowledge loss in single-teacher models. Furthermore, a spatio-temporal gate module is proposed to eliminate unnecessary frame-level information from different channels and extract important channels for improved feature representation without requiring expert knowledge.

Results:

Experimental results demonstrate the superiority of the proposed method over the current state of the art, achieving a 2% performance improvement on the DEAP dataset.

Conclusions:

The proposed multi-teacher distillation framework with data privacy addresses the challenges of insufficient data availability, limited cross-domain knowledge transfer, and suboptimal feature extraction in subject-independent EEG emotion recognition, demonstrating strong potential for scalable and privacy-preserving emotion recognition applications.

1. Introduction

In recent years, there has been extensive exploration of biological signals to uncover potential relationships between the nervous system and physiological functions in humans [1, 2, 3]. This investigation has revealed new avenues in fields such as disease monitoring [4], exercise therapy [5, 6], and exercise rehabilitation [7, 8]. In the realm of implantable signal processing, electromyography (EMG) signals have successfully controlled bionic hands for amputees [9], demonstrating remarkable success. However, the implantable signal collection mode, which entails open wounds, presents hazards, rendering the non-invasive mode a safer alternative. Electroencephalography (EEG) signals, as a standard non-invasive neural signal, have garnered considerable interest due to their robust temporal resolution and resistance to interference [10]. EEG emotion recognition [11, 12], in particular, has emerged as a valuable method for detecting fear and anxiety disorders, contributing to the improvement of mental health. Meanwhile, some publicly available EEG emotion datasets, including SEED (https://bcmi.sjtu.edu.cn/home/seed/) [13], DEAP (http://www.eecs.qmul.ac.uk/mmv/datasets/deap/) [14] and DREAMER (https://zenodo.org/record/546113) [15], have also provided a foundation and guarantee for the rapid development of this field.

Currently, artificial intelligence (AI)-based approaches, encompassing both traditional machine learning techniques and deep learning (DL) methods [16, 17], have become the predominant paradigm in EEG-based emotion recognition. Approaches such as Support Vector Machines (SVM) [18], Long Short-Term Memory (LSTM) [19], and Convolutional Neural Networks (CNN) [20] have achieved notable success, contributing to substantial improvements in overall performance. However, most of these methods operate in a subject-dependent mode, which restricts their generalizability and practical applicability. In contrast, the subject-independent mode offers broader application prospects, but faces three main challenges in EEG emotion recognition: (1) insufficient available data [21]: large volume of cross-subject data is essential for model training; otherwise, it becomes challenging to capture subject-invariant features [22]; however, EEG signals, due to their sensitive and privacy-preserving nature, are difficult to collect in large quantities, which hinders performance improvement in subject-independent settings; (2) insufficient cross-domain knowledge transfer [23]: the EEG signals of different subjects are from different domains; obtaining universal knowledge from a portion of subjects and achieving cross-domain knowledge transfer to complete the EEG emotion recognition from other subjects are significantly important [24, 25, 26]; however, the current limited performance of cross-domain knowledge transfer cannot further optimize recognition precision; (3) insufficient feature extraction: most EEG signal feature extraction methods, such as differential entropy [27], asymmetric spatial pattern [28], and high order zero crossing count [29] rely on manual design, which requires high professional knowledge from researchers and can sometimes cause information loss; besides, existing automatic feature extraction methods generally focus on spatial information extraction [30], ignoring the screening of time segments, which leads to a large amount of unnecessary information interfering with the recognition results.

In response to the above issues, this paper designs the framework from three corresponding aspects. First, to obtain more data for model training, the paper proposes a multi-teacher knowledge distillation (KD) framework, which includes n subnets under serial mode. All subnets, except the initial one, are trained sequentially on distinct EEG datasets, incorporating feature representations from all previously trained subnets to consolidate accumulated knowledge and facilitate cross-domain knowledge transfer [31]. During this process, each subnet relies solely on the pretrained weights of preceding subnets without sharing raw EEG data, thereby preserving data privacy [32] while enhancing performance in EEG-based emotion recognition.

Second, to enhance cross-domain knowledge transfer, the study introduces a knowledge filter (KF) and an adaptive multi-teacher KDloss. The KF integrates features from multiple domains using a gated multi-head attention mechanism, selectively inheriting knowledge relevant to the current domain and thereby improving cross-domain knowledge transfer. Meanwhile, the adaptive multi-teacher KDloss adaptively regulates the direction of knowledge transfer by evaluating feature similarities across all preceding subnets, mitigating potential knowledge loss associated with reliance on a single teacher model.

Third, to improve the effectiveness of automatic feature extraction, the study proposes a spatio-temporal gate (STG) module to handle varying reaction periods of the same emotion across different EEG electrodes. The module identifies salient frame-level segments for each channel by computing similarities between frames and applying a sigmoid function, and subsequently adjusts channel-wise weights through self-attention to optimize frame-level representations. This feature extraction process aligns with the characteristics of EEG signals, effectively filtering out irrelevant information.

A multi-teacher knowledge distillation framework with data privacy (MTKDDP) is suggested for EEG emotion recognition. The framework employs adaptive knowledge distillation, feature filtering, and spatiotemporal gating to tackle challenges related to limited cross-domain knowledge transfer, inadequate data, and suboptimal feature extraction, thereby enhancing generalization, robustness, and automatic feature learning in subject-independent contexts. Experimental results demonstrate a 2% enhancement in performance compared to current methodologies on the DEAP dataset.

2. Related Work
2.1 Knowledge Distillation

KD [33] has emerged as a crucial technique for model compression and transfer learning. By leveraging a loss function that incorporates both soft and hard targets, KD enables the effective transfer of knowledge from a teacher network to a student network. As a widely adopted approach, it facilitates transferring knowledge from one or multiple large teacher models to a more compact student model [33, 34]. However, conventional KD approaches generally depend on a single teacher network, which limits knowledge diversity and fails to effectively address cross-domain variability in EEG emotion recognition [23, 26]. To overcome this limitation, multi-teacher knowledge distillation (MTKD) has been introduced, allowing a student model to learn from multiple teachers and thereby enhancing the diversity and robustness of knowledge transfer [35, 36, 37, 38]. For example, Ye et al. [39] propose a multi-teacher feature-ensemble strategy that aggregates representations from multiple teachers to enhance transfer effectiveness. Nevertheless, many existing MTKD methods aggregate teacher outputs using simple averaging or other static rules, which can degrade performance when teacher predictions are of low quality [40, 41]. Despite these advances, MTKD still faces limitations in facilitating cross-domain knowledge transfer, as naive feature aggregation may result in information loss and fail to fully capture heterogeneous domain knowledge. To address this issue, this paper proposes an MTKD method that combines KF with an adaptive multi-teacher KDloss, selectively integrating complementary cross-domain information while minimizing redundancy and noise, thereby improving generalization and robustness in subject-independent EEG emotion recognition.

2.2 EEG-Based Emotion Recognition

AI-based technologies for EEG emotion recognition achieve significant progress, encompassing both traditional machine learning and DL approaches. Early methods, such as SVM and K-Nearest Neighbors (KNN), demonstrate feasibility but depend on manual feature engineering and struggle to scale effectively to high-dimensional EEG data [42, 43]. Subsequent variants like the Universal Support Vector Machine (USVM) aim to reduce complexity yet still show generalization limitations [44]. DL frameworks emerge as powerful alternatives, learning complex representations directly from EEG signals. For instance, regional-asymmetric convolution enhances fine-grained pattern capture, while generative modeling supports improved training efficiency [30, 45]. Despite these advances, the subject-independent setting continues to face inadequate data availability. Robust, generalizable features require large and diverse datasets, which remain difficult to obtain in practice [11, 13]. Data augmentation and transfer learning provide partial alleviation but frequently require cross-subject or cross-institutional sharing of raw EEG data, thereby introducing privacy and compliance concerns [16, 17, 46]. KD addresses these concerns by enabling knowledge transfer without exposing raw data [32]; nevertheless, most KD approaches in EEG emotion recognition rely on single-teacher frameworks, constraining knowledge diversity and limiting robustness under conditions of data scarcity [47, 48]. Recent work on multi-teacher KD introduces adaptive teacher selection and weighting to improve stability and transfer quality, although its application to EEG emotion recognition remains underexplored [36, 39].

Beyond data limitations, inadequate feature extraction constitutes a second major bottleneck. EEG signals exhibit rich spatiotemporal dynamics, necessitating models that capture spatial–temporal dependencies while effectively suppressing noise. Handcrafted features, such as differential entropy and asymmetric spatial patterns, provide partial solutions but remain susceptible to information loss and rely heavily on expert design [27, 28]. DL models improve automatic representation learning, yet many lack dynamic mechanisms to filter redundant or irrelevant frame-level segments. For instance, models designed to learn spatiotemporal dependencies may still omit adaptive cross-channel gating, leading to residual noise and reduced accuracy [49]. Addressing these gaps requires frameworks that simultaneously extract fine-grained spatiotemporal representations and perform dynamic filtering. To this end, the proposed MTKDDP framework sequentially trains multiple subnets on heterogeneous EEG datasets without sharing raw data and distills complementary knowledge from multiple teachers to enhance generalization under conditions of limited data [36, 39]. In parallel, a STG module adaptively selects salient frame-level features across channels using attention-based similarity, thereby reducing redundancy and strengthening representation quality [50]. In combination, knowledge filtering and adaptive distillation loss further improve cross-domain generalization while maintaining privacy protection.

3. Method

Here, a comprehensive overview is presented for the four components of the MTKDDP framework, including EEG signal preprocessing, the MTKDDP architecture, the STG module and the MTKDDP algorithm.

3.1 EEG Signal Preprocessing

The original EEG signals are commonly represented as multi-dimensional tensors with baseline signals. Since neural networks generally require matrix-like input, these tensors do not directly satisfy this requirement. Consequently, preprocessing is necessary to reshape EEG signals into suitable dimensions for network input. Based on the design of the STG module in the paper, the original EEG signal is processed into a signal of size (C, T, N) without the baseline signals, where C represents the number of channels, T is the number of time steps, and N indicates the number of samples. The detailed preprocessing steps are as follows:

First, the baseline in the original EEG signals is removed using signals recorded in a calm state. Specifically, the per-second mean of the calm-state signals is computed as the baseline, which is then subtracted from the emotional signals to complete the preprocessing. Second, similar to Wang’s method [48], the processed signal is divided into multiple frame-level signals according to the 3 seconds per segment [50]. Here, the length of each frame level signal is defined as 3 seconds to ensure sufficient emotional information and avoid excessive redundant information. Therefore, if the total length of the original signal is T𝑠𝑢𝑚, then T = T𝑠𝑢𝑚 /3 and the size of the signal is (T, C, N). Afterwards, the existing signals are transposed as the signal of size (C, T, N), which meets the input requirements of our framework. Additionally, to reduce model complexity, downsampling is applied, proportionally scaling the data with respect to N to improve computational efficiency.

3.2 MTKDDP Architecture

To optimize performance in subject-independent EEG emotion recognition, this work proposes a multi-teacher KD framework that preserves data privacy [46, 51, 52], as shown in Fig. 1, for enhancing knowledge transfer and automatic feature extraction. The framework includes n subnets and (n-2) KFs, which are detailed in the subsequent subsection. Except for the initial subnet, each subnet contains only 1.3 M parameters, demonstrating the framework’s efficiency. The framework is optimized by sequentially training subnets on different EEG signals to ensure privacy protection. Each subnet comprises two LSTM layers, a gate module, and a multi-dense module. The first LSTM layer extracts temporal features, while the second LSTM layer maps these features to labels. The gate module, positioned between the two LSTM layers and detailed in the following subsection, selectively eliminates unnecessary features. Finally, the multi-dense module, located at the end, performs label prediction. Especially, the first LSTM layer and gate module of the first subnet are different from the subsequent subnets, where the first LSTM layer in the first subnet is trained by EEG signals of single channel together to obtain feature information. The first gate module is the STG module, which screens emotional response periods across different electrode positions and removes irrelevant information, while subsequent gate modules operate based on temporal features. During training, the weights of each trained subnet are frozen, and only the filtered features are provided as input to the following subnet to enhance knowledge transfer. When more than two subnets are employed, the KF selects features generated by all preceding subnets to ensure the effectiveness of knowledge transfer. In addition, except for the first LSTM layer of each subnet using ReLU activation function and the output layer using Sigmoid activation function, all other activation functions are Tanh.

Fig. 1.

The architecture of our MTKDDP for EEG emotion recognition. MTKDDP, multi-teacher knowledge distillation framework with data privacy; EEG, electroencephalography; KD, knowledge distillation; SSIM, Structural Similarity; LSTM, Long Short-Term Memory.

3.3 Spatio-Temporal Gate Module

Capturing emotional information across the entire EEG signal presents challenges, as the response times at different electrode positions are often inconsistent. To overcome the limitations of existing feature extraction methods that focus solely on spatial information, this work introduces the STG module, illustrated in Fig. 2, which extracts the most relevant features between electrode positions and emotional responses. The STG module effectively filters out unnecessary features from each electrode, thereby enhancing automatic feature extraction. The module comprises three components: a temporal-based gate module set, an LSTM layer, and a spatio-based gate module. In the temporal-based gate module set, gate modules are utilized to filter features C times, with shared weights optimizing the temporal-based gate modules more effectively. Subsequently, after adjusting the feature size, features with dimensions (T, C × N) undergo feature extraction in the LSTM layer. Following another adjustment to the feature size (C, T × N), the framework executes feature filtering through the STG module. Throughout this process, the specifics of the gate units, used in temporal-based gate modules and the spatio-based gate modules, are detailed as follows.

Fig. 2.

The details of spatio-temporal gate module.

Assuming the input of the gate module is H = <h1, h2, …, hT>, where T is the number of time steps. First, the similarity between extracted features at different time steps as follows:

(1) 𝑔 t , t = 𝑡𝑎𝑛ℎ ( W 𝑔 h t + W 𝑔 h t + b 𝑔 )

(2) e t , t = σ ( W e 𝑔 t , t + b e )

where Wg, Wg′ and We are the weights of the linear layer, bg and be indicate the bias of the linear layer, t and t represent the time steps, and σ denotes the Sigmoid activation function.

The σ activation function efficiently assigns zero weights to irrelevant features within the correlation ground, effectively eliminating unnecessary elements and optimizing the feature extraction process. According to Eqns. 1,2, et = <e(𝑡,1), e(𝑡,2), …, e(𝑡,𝑇)> can be computed. Meanwhile, the weight vector at in AT*T = <a1, a2, a3, …, aT> can be obtained as follows:

(3) a t = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ( e t )

Hence, the weight matrix AT*T for filtering features is calculated. Then, the paper can get the filtered feature O as follows:

(4) O = A T * T H T * 1

Here, the STG module can be used to usefully filter the unnecessary features by Eqns. 1,2,3,4. In addition, all subnets except the first employ gate units as gate modules to enhance feature extraction.

3.4 MTKDDP Algorithm

EEG signals involve personal privacy and are challenging to collect publicly in large quantities, which limits the optimization of models for EEG emotion recognition. In particular, in the subject-independent mode, cross-domain knowledge transfer remains insufficient. Therefore, the paper proposes a multi-teacher KD framework with data privacy, which includes n subnets trained sequentially. The framework enables knowledge transfer to enhance performance without sharing EEG signals, while leveraging multi-teacher KD to facilitate cross-domain knowledge transfer, using the adaptive multi-teacher KDloss to guide the direction of knowledge transfer. The training process is divided into three stages.

First, the preprocessed EEG signals are utilized to train the initial subnet, wherein the network’s output is aligned directly with the ground truth distribution through the cross-entropy loss function, expressed as follows:

(5) L ( p , q ) = - ( p * 𝑙𝑜𝑔 ( q ) + ( 1 - p ) * 𝑙𝑜𝑔 ( 1 - q ) )

where p denotes the prediction from the first subnet and q represents the ground truth.

Second, the features meticulously filtered by the initial subnet serve as inputs for the subsequent subnet. The KDloss is then applied to guide its outputs toward both the soft targets predicted by the initial subnet and the hard targets derived from the actual labels. This intricate process achieves effective knowledge transfer and successfully concludes the model training phase. The KDloss is represented as follows:

(6) KDloss = λ * L ( q , softmax ( z ) ) + ( 1 - λ ) * L ( softmax ( p / T * ) , softmax ( z / T * ) )

where z indicates the prediction from the second subnet before the last Sigmoid function, T* indicates the temperature, and λ indicates the trade-off ratio between the real labels and soft targets.

Third, the multi-teacher knowledge KD strategy is employed to train the n-th subnet (n > 2), diverging from the preceding two training stages. This phase involves utilizing the KF and the adaptive multi-teacher knowledge KDloss, as illustrated in Fig. 3. Notably, the features filtered from the preceding (n-1) subnets are collectively transferred to the subsequent KF during this process together. The KF, as shown in the Fig. 3, inspired by the multi-head attention [49], can select the important features for optimizing the description of all knowledge from previous subnets. In the process, the filtered feature O𝑖 are concatenated as a feature of size ((n-1) × C, T, N). Then, Q, K and V are used to compute the similarity between the filtered feature and get the weight tensor A. Here, different from the traditional multi-head attention, the Sigmoid function, as the gate module, is introduced to eliminate unnecessary features for elevating the knowledge transfer, and the A are turned to A* though Sigmoid function. Afterwards, the features are filtered to select the knowledge from the previous subnets. Finally, the processed features are simplified as the tensors with original dimensions by linear layer or principal component analysis. In addition, the adaptive multi-teacher KDloss is applied into model optimization for avoiding the loss between knowledge transfer, where the similarities between filtered feature O𝑖 are first obtained by Structural Similarity (SSIM) [53].

Fig. 3.

The details of multi-teacher knowledge distillation strategy. GT, ground truth.

The adaptive loss is defined as follows:

(7) KDLoss = λ * L ( q , softmax ( z ) ) + ( 1 - λ ) * ( k * L ( softmax ( p 1 / T * ) , softmax ( z / T * ) ) + ( 1 - k ) * L ( softmax ( p 2 / T * ) , softmax ( z / T * ) ) )

where k indicates the ratio between the direction of two soft targets. z, p 1 and p 2 represent the prediction from (n+1)-th subnet, n-th subnet and i-th subnet respectively.

Here, to retain maximum knowledge, the subnet exhibiting the lowest similarity between the output and feature O𝑛 is employed as an additional teacher network to optimize the current subnet. Thus, i is solved as follows:

(8) i = 𝑎𝑟𝑔𝑚𝑖𝑛 ( 𝑆𝑆𝐼𝑀 ( O i , O n ) ) , i = 1 , 2 , , n

During the training process, k can adjust the balance between the training cost and the final performance. As k increases, the subnet mainly inherits the knowledge of its previous subnet, where the model can converge quickly with weak generalization performance. In contrast, the model achieves improved generalization performance, although convergence is slow. Increasing the number of subnets can enhance both performance and generalization, but this incurs substantial computational cost. To address this issue, a pruning method is employed, in which the model is pruned and training is halted once the feature similarity of three consecutive subnets exceeds a predefined threshold.

4. Experiments

This section presents the implementation details used to demonstrate the performance of the framework. Then, the performance of the MTKDDP framework is analyzed through comparative experiments on the public DEAP dataset [14] and DREAMER dataset [15] in subject-independent mode. Subsequently, the effectiveness of the proposed methods is evaluated sequentially through ablation experiments and visualization analyses.

4.1 Implementation Details

In comparative experiments, the selected frameworks include classical models such as Convolutional Long Short-Term Memory (CLSTM) [54] and Attention-LSTM [55], as well as state-of-the-art frameworks including Recurrent Attention Convolutional Neural Network (RACNN) [30], Attention-based Temporal Dense Dual (ATDD)-LSTM [56], Frame-Level Teacher-Student Learning With Data Privacy (FLTSDP) [47], Frame-Level Distilling Neural Network (FLDnet) [48], Cascaded Gated Recurrent Unit-Multi-Channel Dynamic Graph Network (CGRU-MDGN) [42], and Multi-Task Learning Fusion Network (MTLFuseNet) [57]. A dual strategy is employed for selecting comparison algorithms, incorporating both classical and state-of-the-art (SOTA) approaches to ensure a comprehensive and fair evaluation. Classical models such as CLSTM and Attention-LSTM serve as widely recognized baselines in EEG-based emotion recognition due to their established capability in feature extraction and temporal sequence modeling, providing reliable benchmarks for assessing the overall effectiveness of the proposed method. In addition, recent SOTA methods—including FLTSDP, which integrates data privacy protection mechanisms, as well as other advanced approaches based on knowledge distillation or transfer learning—are incorporated to evaluate performance against the latest developments in data privacy, cross-domain knowledge transfer, and feature extraction. This combination ensures that the evaluation reflects both competitiveness with established baselines and superiority in cutting-edge scenarios.

Here, the EEG signals are first preprocessed into 32 × 20 × 60 tensors and 14 × 20 × 48 tensors respectively for training. Second, considering the number of participants in the public datasets, where the DEAP dataset [14] and DREAMER dataset [15] have 32 and 23 subjects respectively, the number of subnets in the framework is defined as 3 to ensure that each subnet can have certain generalization performance in 120 epochs during training. In addition, the multi-teacher KD parameters k, T* and learning rate of this experiment are set to 0.9, 0.9 and 0.001, respectively. The emotional dimensions-valence, arousal, dominance, and liking—were annotated based on participants’ self-reported ratings in the DEAP and DREAMER datasets. For each stimulus, participants provide evaluations using a standardized 9-point Likert scale. For binary classification, a threshold of 5 is applied, where scores equal to or greater than 5 are labeled as “high” and scores below 5 as “low”, in accordance with widely adopted practices in EEG-based emotion recognition [58]. Although these labels are self-reported and conventional inter-rater reliability metrics are not applicable, both datasets serve as well-established benchmarks with validated rating procedures. This labeling protocol ensures a clear separation between high and low emotional states while maintaining a balanced distribution of samples for model training, thereby providing a reliable foundation for subsequent comparative and ablation experiments.

To statistically validate the differences among multiple algorithms, the Friedman test is employed as a nonparametric approach suitable for comparing the rankings of several models across multiple datasets or experimental conditions [59]. This test first ranks the results of different algorithms across multiple datasets or experimental conditions and then computes the Friedman statistic τF based on the average ranks. The statistic is subsequently compared with the critical value from the chi-square distribution at a given confidence level (e.g., α = 0.05) to assess whether the differences among algorithms are statistically significant. The key parameters include the number of algorithms k, the number of datasets or experimental cases N, and the corresponding critical value. When the null hypothesis—that all algorithms perform equally—is rejected, post hoc procedures such as the Nemenyi test are applied to conduct pairwise comparisons and identify which algorithms differ significantly.

4.2 Experiments on DEAP Dataset

First, to effectively demonstrate the performance of the framework, the larger DEAP dataset is employed for comparative experiments without data privacy in the subject-independent mode. Experiments utilize 5-fold cross-validation to evaluate the effectiveness and robustness of all methods. Consequently, 32 subjects are randomly divided into five groups for training and testing. The results, presented in Table 1, indicate that the framework outperforms all comparison methods across the four emotional dimensions, achieving an improvement exceeding 2%. These results further demonstrate the effectiveness of the framework in automatic feature extraction and cross-domain knowledge transfer. Compared with the FLTSDP [47] and FLDnet [48] frameworks, which also perform frame-level feature filtering, the proposed method demonstrates substantially enhanced performance. This improvement indicates that the STG module accurately captures distinct reaction times across electrode positions, thereby more effectively avoiding interference from unnecessary features. Furthermore, MTKDDP outperforms these cross-domain integrated knowledge methods in terms of overall generalization and knowledge transfer. The CGRU-MDGN proposed in Guo and Wang [49] effectively processes both the temporal dynamics and spatial features of EEG signals, including their non-Euclidean relationships, thereby enhancing the accuracy of emotion recognition. However, this model has limitations in cross-domain knowledge transfer, while MTKDDP adopts multi-teacher KD strategy to integrate knowledge of different subnets more effectively and enhance the effect of cross-domain knowledge transfer. In addition, MTLFuseNet is an emotion recognition model based on EEG deep latent feature fusion and multi-task learning [57]. MTKDDP addresses challenges such as multi-task feature interference, significant knowledge forgetting, and low computational efficiency. It employs multi-teacher hierarchical distillation to decouple task-specific features, dynamic knowledge solidification to mitigate forgetting, and combines parameter sharing with asynchronous updates to improve efficiency, thereby substantially enhancing accuracy and robustness. Furthermore, whereas conventional multi-teacher approaches often incur prohibitive computational overhead, MTKDDP achieves comparable time complexity to SOTA single-teacher frameworks through optimized parameter sharing and asynchronous gradient updates. This dual advantage of increased robustness against forgetting and preserved operational efficiency highlights the distinctive applicability of the framework in real-world scenarios.

Table 1. Comparisons of accuracies and standard deviations on the DEAP dataset.
Prediction target Attention-LSTM RACNN ATDD-LSTM FLDNet FLTSDP CGRU-MDGN MTLFuseNet MTKDDP
Valence 76.77% ± 10.71 80.55% ± 12.50 74.73% ± 13.07 83.85% ± 11.34 92.40% ± 5.20 69.92% ± 9.42 71.33% ± 5.24 94.24% ± 4.50
Arousal 70.71% ± 8.88 74.64% ± 8.72 67.44% ± 8.03 78.22% ± 10.14 82.51% ± 7.28 70.08% ± 10.28 73.28% ± 7.74 92.58% ± 4.98
Dominance 72.06% ± 9.79 74.64% ± 11.01 68.41% ± 9.98 77.52% ± 10.30 89.95% ± 5.63 71.01% ± 13.43 - 93.27% ± 4.00
Liking 76.48% ± 9.06 79.30% ± 11.80 75.47% ± 9.38 82.42% ± 9.56 91.20% ± 4.40 - - 94.64% ± 3.51
Valence - dp 65.31% ± 11.98 80.88% ± 12.19 56.62% ± 31.69 - 91.45% ± 6.04 - - 94.89% ± 4.46
Arousal - dp 62.67% ± 7.76 67.69% ± 10.23 58.94% ± 19.06 - 85.03% ± 7.90 - - 91.92% ± 4.27
Dominance - dp 66.07% ± 6.93 69.07% ± 11.99 62.22% ± 19.61 - 87.49% ± 4.50 - - 91.90% ± 4.65
Liking - dp 66.61% ± 10.43 77.59% ± 11.28 66.61% ± 10.43 - 89.14% ± 6.60 - - 93.87% ± 5.07
Trainingtime (s/epoch) 18.21 85.06 123.21 33.67 36.01 - - 38.10

Note: dp indicates that only a few subjects in the training dataset are used to train with data privacy. Attention-LSTM, Attention-based Long Short-Term Memory; RACNN, Recurrent Attention Convolutional Neural Network; ATDD-LSTM, Attention-based Temporal Dense Dual-Long Short-Term Memory; FLDNet, Frame-Level Distilling Neural Network; FLTSDP, Frame-Level Teacher-Student Learning With Data Privacy; CGRU-MDGN, Cascaded Gated Recurrent Unit - Multi-Channel Dynamic Graph Network; MTLFuseNet, Multi-Task Learning Fusion Network; MTKDDP, Multi-Teacher Knowledge Distillation Framework with data privacy.

4.3 Learning Mechanism With Data Privacy

Second, to evaluate the knowledge transfer capability of the framework, comparative experiments with private data are conducted on the DEAP dataset. In these experiments, the training set under 5-fold cross-validation is divided into three groups, each containing non-overlapping EEG data from 9, 8, and 8 subjects, respectively. The results, presented in Table 1, demonstrate the optimal performance achieved by the framework. Compared with classical methods, the framework exhibits a clear performance advantage, particularly in the stability and accuracy of predictions, which also reflects the effectiveness of the privacy protection mechanisms. In comparison with the FLTSDP framework [47], which similarly incorporates privacy protection mechanisms, the superior performance of the framework highlights the contribution of the multi-teacher KD strategy in facilitating cross-domain knowledge transfer.

4.4 Ablation Study

Third, to assess the effectiveness of the proposed modules, two types of ablation studies are conducted on the DEAP dataset. In the first type of ablation experiment, the STG module, teacher-student (TS) framework, KF module, and adaptive multi-teacher knowledge distillation loss (AMTKDL) are sequentially integrated into the base model, which corresponds to the Attention-based Long Short-Term Memory (attLSTM) model from Wang’s framework [48]. In the second type of ablation experiment, individual removal of the STG or KF modules is tested, further confirming the essential contribution of each module to feature selection and knowledge integration. The results of these ablation experiments are presented in Table 2. As shown, the proposed modules successfully enhance the performance of EEG emotion recognition, particularly the STG module, indicating its key role in eliminating unnecessary features and effectively improving overall performance. Furthermore, compared with a single network, multi-classifiers in the TS framework effectively integrate knowledge from previous subnets, enhancing network robustness. The KF module optimizes knowledge composition to further improve feature extraction, as reflected in the experimental results. Finally, the AMTKDL refines the robustness of knowledge transfer and further enhances the precision of the framework.

Table 2. Ablation study results on the DEAP dataset.
Prediction target Valence Arousal Dominance Liking
attLSTM (A1) 87.68% ± 7.94 80.83% ± 6.45 80.70% ± 8.29 85.52% ± 7.26
A1+STG (A2) 87.21% ± 8.40 87.26% ± 14.60 87.98% ± 6.31 87.53% ± 6.12
A2+TS (A3) 91.05% ± 8.40 89.26% ± 5.12 90.27% ± 4.35 89.70% ± 3.62
A3+KF (A4) 93.45% ± 4.85 90.47% ± 4.58 92.64% ± 4.24 93.93% ± 3.96
A4+AMTKDL (A5) 94.24% ± 4.51 92.58% ± 4.98 93.27% ± 4.00 94.64% ± 3.51
A5-STG (A6) 93.82% ± 6.35 87.35% ± 6.83 88.67% ± 8.44 92.88% ± 4.39
A5-KF (A7) 93.20% ± 7.45 91.64% ± 7.43 92.06% ± 5.88 93.21% ± 7.33
A5+DP 94.89% ± 4.46 91.92% ± 4.27 91.90% ± 4.65 93.87% ± 5.07

Note: attLSTM is the base model.

To quantitatively assess the findings from the ablation experiments, the Friedman test is conducted, followed by the Nemenyi post-hoc test. The Friedman statistic τF was 21.00, exceeding the critical value of 2.488 at a significance level of α = 0.05, indicating that the performance differences among the ablation configurations are statistically significant. Subsequently, the Nemenyi post-hoc test was applied, with the critical distance (CD) calculated as 5.2498 based on q0.05, allowing for a precise distinction of pairwise differences between the algorithms. As illustrated in the Friedman test (Fig. 4), the complete model (A5+DP) achieves the highest performance, whereas removal of either the STG or KF module leads to notable performance degradation. These results underscore the critical role of the STG and knowledge fusion components in ensuring the robustness and effectiveness of the proposed framework.

Fig. 4.

Friedman test of ablation study. DP, data privacy; KF, knowledge filter; STG, spatiotemporal gate; AMTKDL, adaptive multi-teacher knowledge distillation loss; TS, teacher-student; attLSTM, Attention-based Long Short-Term Memory.

4.5 Ablation Study in Knowledge Transfer and Visualization of KF

Then, to evaluate the capability of knowledge transfer among multiple subnets, ablation experiments on knowledge transfer methods and feature screening visualization experiments are conducted. First, several knowledge transfer methods—including the TS framework with Student-1 and Student-2, integrated networks, KF, and AMTKDL—are presented in Table 3. Here, the base model corresponds to the attLSTM model from Wang’s framework [48] with the STG module. From the results in Table 3, the TS network clearly enhances EEG emotion recognition; however, deeper student networks do not substantially improve recognition accuracy. Similarly, the integrated network from FLTSDP [47] does not further improve performance and even underperforms compared with the deeper student networks. The KF mechanism effectively reorganizes knowledge, capturing information more aligned with the current domain, better integrating knowledge from previous subnets, and thereby facilitating EEG emotion recognition. To further assess the contributions of different components in the knowledge transfer process, the Friedman test accompanied by the Nemenyi post-hoc analysis was conducted on the results reported in Table 3. The computed Friedman statistic τF reached 57.0, which is above the critical threshold of 2.488 at a significance level of α = 0.05, confirming that the performance differences across the ablation settings are statistically meaningful. The subsequent Nemenyi test was performed to identify specific pairwise differences, with the critical distance (CD) calculated as 3.7702 based on q0.05. As depicted in the Friedman test (Fig. 5), the configuration including all knowledge transfer components (A5+AMTKDL) achieves the top ranking, whereas omitting student networks or the KF module leads to noticeable reductions in performance. These results highlight the importance of sequential knowledge integration and feature refinement for enhancing the effectiveness and stability of the proposed framework.

Table 3. Ablation study results of knowledge transfer on the DEAP dataset.
Prediction target Valence Arousal Dominance Liking
Base (A1) 87.21% ± 8.40 87.26% ± 5.17 87.98% ± 6.31 87.53% ± 6.12
A1+S1 (A2) 90.09% ± 6.26 87.89% ± 5.59 89.18% ± 6.03 89.04% ± 6.10
A2+S2 (A3) 91.05% ± 8.40 89.26% ± 5.12 90.27% ± 4.35 89.70% ± 3.62
A3+IN (A4) 91.08% ± 5.33 88.01% ± 5.65 88.98% ± 5.27 89.45% ± 6.75
A3+KF (A5) 93.45% ± 4.85 90.47% ± 4.58 92.64% ± 4.24 93.93% ± 3.96
A5+AMTKDL 94.24% ± 4.51 92.58% ± 4.98 93.27% ± 4.00 94.64% ± 3.51

Note: attLSTM with STG module is the base model. S1, S2, and IN are defined as student-1, student-2, and integrated network, respectively. IN, integrated networks.

Fig. 5.

Friedman test of ablation study in knowledge transfer.

Additionally, the visualization results of the KF mechanism are presented in Fig. 6, illustrating feature extraction outcomes for the four emotional dimensions derived from EEG signals across 32 channels. The “Original signals” row visualizes raw EEG activity, with widespread red regions indicating noisy and unfocused signals with no clear alignment to emotion-specific responses. Rows corresponding to Subnet 1–3 illustrate features extracted by successive subnets. Subnet 1 shows scattered and inaccurate activation, capturing noisy or irrelevant activity. Subnet 2 displays more targeted activations, partially aligning with known emotion-related regions, but still inconsistent. By Subnet 3, activations are highly precise, with red regions sharply highlighting channels critical for each emotional dimension, as the KF module fuses outputs from previous subnets, filtering out noise and domain-irrelevant features while retaining emotion-relevant ones. This visualization demonstrates that early subnets do not fully capture target-domain emotion features; however, as subnets progress under KF guidance, features become increasingly domain-specific and accurately aligned with known neural correlates of emotion. Overall, Fig. 6 illustrates how KF and adaptive subnets collaboratively refine raw, noisy signals into precise, domain-relevant, and neurophysiologically plausible features, validating the effectiveness of adaptive knowledge transfer and the framework’s capability to enhance EEG emotion recognition.

Fig. 6.

Visualization results of KF.

4.6 Visualization of Temporal-Based Gates

To evaluate the adaptive feature extraction capability of the STG module, temporal gate weights from multiple EEG channels are visualized, as shown in Fig. 7. The vertical axis represents electrode channels from AF4 to O2, covering frontal, central, parietal, and occipital regions, while the horizontal axis denotes 20 discrete time steps. Color intensity corresponds to gating weights, with red regions indicating higher weights. The visualization reveals that high-weight time intervals vary across channels. Frontal channels exhibit pronounced activation during early time steps, whereas central and parietal channels show sustained or delayed activation in later periods. This distribution indicates that frame-level features are both channel-specific and time-specific, with different electrodes carrying emotion-relevant information within distinct temporal windows.

Fig. 7.

Visualization results of temporal-based gates.

The STG mechanism applies gating across both temporal and spatial dimensions. For each channel, a temporal gate assigns weights to frame-level features across time steps, followed by spatial gating that integrates contributions from all channels. This process prioritizes informative intervals and suppresses irrelevant or noisy segments, generating spatio-temporal representations optimized for emotion recognition. The strategy enhances signal quality by emphasizing high-weight intervals and reducing noise. Activation patterns in frontal and central-parietal regions align with established stages of emotional processing, improving physiological interpretability. By focusing on emotion-relevant channel-time features, the mechanism strengthens discriminative capability. Overall, the STG module functions as a spatio-temporal filter that automatically identifies critical EEG features for emotion recognition, ensuring effective and interpretable feature extraction.

4.7 Privacy-Preserving Performance

To evaluate the privacy-preserving capability of the proposed framework, differential privacy [60] is incorporated using the Differentially Private Stochastic Gradient Descent (DP-SGD) mechanism, which limits the sensitivity of individual samples through gradient clipping (threshold C = 1.0) and Gaussian noise injection. During training, the privacy budget is set to ε, and four different δ values (10-2, 10-3, 10-4, 10-5) are tested to investigate the trade-off between privacy and accuracy, with smaller δ values providing stronger privacy at the cost of increased noise.

As shown in Table 4, accuracy decreases slightly as δ decreases, for example from approximately 91.4% to 89.96%, with performance reduction within 2% and standard deviations remaining relatively stable. These results indicate that the framework maintains robust performance while providing effective privacy protection, confirming the successful integration of DP-SGD without substantial loss of model utility.

Table 4. Accuracy comparison under varying privacy budgets.
δ 0.01 0.001 0.0001 0.00001
Valence 93.49% ± 6.57 92.76% ± 6.77 92.50% ± 6.70 91.98% ± 6.57
Arousal 89.29% ± 8.62 89.00% ± 8.39 88.47% ± 8.69 88.34% ± 8.79
Dominance 89.76% ± 8.20 89.74% ± 7.84 89.39% ± 7.91 89.45% ± 8.28
Liking 92.16% ± 7.32 91.50% ± 7.44 90.64% ± 6.86 90.06% ± 7.64
4.8 Experiments on DREAMER Dataset

Experiments on the DREAMER dataset (Table 5) demonstrate that the MTKDDP framework consistently outperforms SOTA baselines across all three emotional dimensions. The framework achieves marked improvements over single-model approaches, illustrating the effectiveness of leveraging complementary knowledge from multiple teachers to mitigate dataset bias and enhance robustness. Reduced variability further reflects superior stability across subjects and sessions.

Table 5. Comparisons of accuracies on the DREAMER dataset.
Prediction target Attention-LSTM CLSTM ATDD-LSTM FLDNet FLTSDP RACNN MTLFuseNet MTKDDP
Valence 83.19% ± 14.97 83.72% ± 14.60 86.00% ± 13.00 89.91% ± 12.51 91.54% ± 8.14 83.69% ± 14.41 80.03% ± 8.01 96.61% ± 3.73
Arousal 82.17% ± 11.45 81.41% ± 12.92 82.09% ± 11.03 87.67% ± 10.02 90.61% ± 8.13 83.04% ± 12.05 83.33% ± 11.24 96.39% ± 3.32
Dominance 86.67% ± 11.29 85.32% ± 9.58 86.37% ± 9.69 90.28% ± 6.06 91.00% ± 8.02 83.93% ± 12.55 - 97.11% ± 2.92
Valence - dp 81.74% ± 15.46 80.66% ± 18.04 85.37% ± 10.61 - 93.17% ± 7.72 83.87% ± 14.89 - 96.56% ± 2.96
Arousal - dp 81.43% ± 10.7 81.13% ± 12.47 82.95% ± 8.27 - 91.43% ± 7.69 80.82% ± 12.85 - 96.78% ± 3.26
Dominance - dp 84.15% ± 9.97 83.17% ± 11.96 83.54% ± 12.57 - 92.13% ± 6.71 81.59% ± 13.99 - 97.39% ± 2.59

Note: dp indicates that only a few subjects in the training dataset are used to train with data privacy. CLSTM, Convolutional LongShort Term Memory.

This strong performance stems from the synergistic integration of the STG module’s noise-invariant feature extraction and the MTKD framework’s ability to transfer robust, generalized representations. These results highlight the framework’s potential for practical EEG-based emotion recognition applications under substantial inter-subject variability and diverse recording conditions.

4.9 Friedman Test and Nemenyi Test

To demonstrate the superiority, effectiveness, and robustness of the proposed MTKDDP framework, the Friedman test and Nemenyi post-hoc test were conducted. The basic assumption of the Friedman test is that all algorithms perform equally well. Comparisons are conducted across four publicly available EEG datasets, involving the MTKDDP framework and seven other representative algorithms—including Attention-LSTM, RACNN, CLSTM, ATDD-LSTM, MTLFuseNet, and three baseline methods—comprising a total of eight algorithms. The Friedman statistic τF is computed as 4.5224, exceeding the critical value of 2.488 at a significance level of α = 0.05 indicating significant differences among the algorithms. Subsequently, the Nemenyi post-hoc test is conducted, and the Friedman ranking diagram is shown in Fig. 8. Specifically, Fig. 8 is constructed based on the results of the DEAP and DREAMER datasets, where the statistical analyses are conducted on the Valence and Arousal dimensions, and the outcomes from these two dimensions are combined to generate the comprehensive ranking diagram. In this figure, the critical distance is calculated as 5.2498, and the central point represents the average ranking. The ranking diagram indicates that the MTKDDP framework consistently outperformed all other methods with statistically significant improvements. These results demonstrate that integrating multiple subnets and extracting robust spatio-temporal features is crucial for enhancing EEG-based emotion recognition performance.

Fig. 8.

Friedman on DEAP and DREAMER datasets.

5. Conclusions

For EEG emotion recognition in subject-independent mode, the paper proposes a multi-teacher KD framework with data privacy. The framework sequentially trains multiple subnets with different EEG signals and employs a multi-teacher KD strategy that integrates KF and adaptive KDloss to enhance knowledge transfer. Moreover, the STG module alleviates response time variations across electrodes, thereby improving feature extraction and eliminating irrelevant information. Experimental results demonstrate that the MTKDDP framework consistently outperforms SOTA methods, achieving higher accuracy on the DEAP and DREAMER datasets. While the current work remains at the experimental stage and has not yet been clinically deployed, these findings indicate strong potential for future applications in real-world clinical scenarios.

However, MTKDDP still requires further verification on diverse cross-domain EEG datasets. As more EEG signals become available, the framework can add subnets to improve performance and generalization, though with higher computational cost. To address this, future work applies a pruning strategy, stopping training when feature similarity among three consecutive subnets exceeds a threshold. Despite more subnets, each maintains a small number of parameters, keeping the overall model size manageable for practical deployment.

Availability of Data and Materials

The datasets used and analyzed in the present study are available upon request from the corresponding author.

Author Contributions

JQY, THG, and CL designed the research study. JQY and THG performed the research. CL and JZX provided help and advice on the experiments, JZX analyzed the data, and JQY drafted the manuscript. THG and CL guided the experimental design and revised the article. All authors contributed to critical revision of the manuscript for important intellectual content. All authors read and approved the final manuscript. All authors have participated sufficiently in the work and agreed to be accountable for all aspects of the work.

Ethics Approval and Consent to Participate

Not applicable.

Acknowledgment

Not applicable.

Funding

This work is supported by National Natural Science Foundation of China under Grant No.62403264, China Postdoctoral Science Foundation under Grant No.2024M761556, Qing-dao Natural Science Foundation under Grant No.24-4-4-zrjj-94-jch, Postdoctoral Innovation Project of Shandong Province under Grant No.SDCX-ZG 202400312, Qingdao Postdoctoral Applied Foundation under Grant No.QDBSH20240102029, Natural Science Foundation of Shandong Province under Grant No.ZR2025ZD15 and Systems Science Plus Joint Research Program of Qingdao University under Grant No.XT2024202.

Conflict of Interest

The authors declare no conflict of interest.

References

Publisher’s Note: IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Cite

Share