1 Key Laboratory of Numerical Simulation of Sichuan Provincial Universities, School of Mathematics and Information Sciences, Neijiang Normal University, 641000 Neijiang, Sichuan, China
2 School of Artificial Intelligence, Neijiang Normal University, 641004 Neijiang, Sichuan, China
3 School of Computer Science, Guangdong Polytechnic Normal University, 510665 Guangzhou, Guangdong, China
4 ZUMRI-LYG Joint Lab, Zhuhai UM Science and Technology Research Institute, 519031 Zhuhai, Guangdong, China
5 School of Mathematics and Computer Science, Northwest Minzu University, 730030 Lanzhou, Gansu, China
6 Department of Electrical and Computer Engineering, University of Macau, 999078 Macau, China
7 Guangdong Provincial Key Laboratory of Intellectual Property and Big Data, Guangdong Polytechnic Normal University, 510665 Guangzhou, Guangdong, China
†These authors contributed equally.
Abstract
Emotion recognition from electroencephalography (EEG) can play a pivotal role in the advancement of brain-computer interfaces (BCIs). Recent developments in deep learning, particularly convolutional neural networks (CNNs) and hybrid models, have significantly enhanced interest in this field. However, standard convolutional layers often conflate characteristics across various brain rhythms, complicating the identification of distinctive features vital for emotion recognition. Furthermore, emotions are inherently dynamic, and neglecting their temporal variability can lead to redundant or noisy data, thus reducing recognition performance. Complicating matters further, individuals may exhibit varied emotional responses to identical stimuli due to differences in experience, culture, and background, emphasizing the necessity for subject-independent classification models.
To address these challenges, we propose a novel network model based on depthwise parallel CNNs. Power spectral densities (PSDs) from various rhythms are extracted and projected as 2D images to comprehensively encode channel, rhythm, and temporal properties. These rhythmic image representations are then processed by a newly designed network, EEG-ERnet (Emotion Recognition Network), developed to process the rhythmic images for emotion recognition.
Experiments conducted on the dataset for emotion analysis using physiological signals (DEAP) using 10-fold cross-validation demonstrate that emotion-specific rhythms within 5-second time intervals can effectively support emotion classification. The model achieves average classification accuracies of 93.27 ± 3.05%, 92.16 ± 2.73%, 90.56 ± 4.44%, and 86.68 ± 5.66% for valence, arousal, dominance, and liking, respectively.
These findings provide valuable insights into the rhythmic characteristics of emotional EEG signals. Furthermore, the EEG-ERnet model offers a promising pathway for the development of efficient, subject-independent, and portable emotion-aware systems for real-world applications.
Keywords
- electroencephalography
- emotions
- deep learning
- convolutional neural networks
- brain waves
- cross-validation studies
Emotions are key aspects of human behavior, communication, and decision-making, making accurate recognition vital for enhancing user experiences and advancing the development of more effective brain-computer interfaces (BCIs) [1]. Recently, detecting emotional states from brain waves on the scalp has provided a noninvasive method to understand human emotions. Thus, emotion recognition through electroencephalography (EEG) has emerged as a key component of affective computing [2, 3, 4]. EEG signals, generated by the brain’s electrical activity, reflect the affective processes underlying different emotional states. EEG emotion recognition can detect subtle changes in brain activity that may not be discernible through other methods, such as facial expressions, body gestures, text, or speech [5, 6, 7, 8], since these methods could fail to represent emotional nuances, especially in ambiguous external expressions [9]. EEG, in contrast, directly measures brain activity, providing a fundamental understanding of emotional processes and serving as a valuable tool for investigating the neural basis of emotions.
However, EEG emotion recognition presents several challenges due to the inherent complexity and noise in the system. First, EEG signals are highly susceptible to various noise and artifacts, including muscle artifacts, eye blinks, and external interference, all of which can degrade recognition accuracy [10]. Second, the high dimensionality and variability of EEG signals make extracting valuable features that can effectively distinguish between different emotions a challenging task. Therefore, feature extraction and classification models are necessary. Based on biomedical signal processing techniques, feature extraction involves identifying and selecting relevant features from EEG signals indicative of specific emotions [11, 12, 13]. This process is key for reducing data dimensionality and enhancing the ability to discern subtle emotional cues. Classification models, such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, cross-modal learning hybrid frameworks, and attention modules [14, 15, 16, 17], have demonstrated considerable potential in overcoming the limitations of traditional approaches. Particularly, CNNs are renowned for their ability to learn hierarchical features, making them well-suited to capture spatial and temporal properties in EEG signals, which beneficially improve accuracy in emotion recognition [18].
Although CNN shows advances, three main limitations require refinement. The five brain rhythms, delta (
To this end, this work proposes a subject-independent EEG emotion recognition model based on the depthwise parallel CNN. Its first step involves extracting power spectral densities (PSDs) from various rhythms to project the 2D images, which preserves a comprehensive representation of channel, rhythm, and temporal properties. Subsequently, a depthwise parallel CNN architecture, denoted as EEG-ERnet, is employed to train and test the rhythmic image features, enhancing its ability to distinguish diverse emotions. Experiments have been conducted on the database for emotion analysis using physiological signals (DEAP) dataset, a widely used benchmark for EEG emotion recognition, evaluating the effectiveness of the proposed model for arousal, valence, dominance, and liking tasks in a subject-independent approach. Hence, with the help of neural networks and rhythmic image features, the EEG-ERnet provides an innovative solution in this field.
The rest of this work is organized as follows: Section 2 reviews related work. Section 3 introduces the DEAP dataset and describes the EEG-ERnet, including the feature extraction for rhythmic-based 2D images and the network model design. Section 4 presents the experimental results, comparative study, and discussion. Finally, Section 5 shows the conclusion.
Early approaches to EEG emotion recognition primarily rely on traditional machine learning techniques, which typically involve extracting features from EEG signals and employing classifiers such as support vector machine (SVM), k-nearest neighbors (k-NN), random forest (RF), linear discriminant analysis (LDA), decision tree, and rotation forest ensemble (RFE). For example, Subasi et al. [26] designed a framework that includes signal denoising using multi-scale principal component analysis (MSPCA), feature extraction through tunable Q wavelet transform (TQWT), dimension reduction via statistical methods, and RFE and SVM classifiers. Experiments on the DEAP dataset demonstrated that their framework achieved about 93% classification accuracy. Tuncer et al. [27] employed a fractal pattern feature generation function, termed the fractal Firat pattern (FFP), for emotion recognition. Their method contained decomposing EEG signals using TQWT and extracting fractal geometry features from decomposed signals through FFP. After that, an iterative chi-square selector (IChi2) was utilized for feature selection, followed by SVM, k-NN, and LDA classifiers. Experiments on the games-based emotion recognition system (GAMEEMO) dataset displayed a maximum accuracy of 99.82% with SVM. Salankar et al. [28] used an approach based on empirical mode decomposition (EMD) and second-order difference plot (SODP). Their method involved decomposing EEG signals into intrinsic mode functions (IMFs) using the EMD, followed by feature extraction from the SODP of these IMFs. The classifiers employed were SVM and a two-hidden-layer multilayer perceptron (MLP). The experimental results from the DEAP dataset indicated 93.8% accuracy in the classifications of arousal and valence. Sarma and Barma [29] selected appropriate EEG segments based on random matrix theory (RMT) to achieve emotion recognition, utilizing PSDs as features. Their experiments were conducted on the Shanghai Jiao Tong University emotion EEG dataset (SEED) and DEAP datasets, yielding classification accuracies of 82.21% for valence and 86.03% for arousal, with the k-NN classifier offering the best performance. Although the above methods accomplished good results, they struggle to investigate the complex and dynamic characteristics of emotions across subjects. Hence, deep learning techniques have been considered.
Deep learning techniques have demonstrated remarkable capabilities in learning hierarchical features, making them well-suited for handling complex and high-dimensional EEG signals. Particularly, CNNs are beneficial for processing spatial and temporal patterns related to emotion recognition [30]. Regarding CNN, its convolutional layers extract local spatial features from multiple channels, and the pooling operations reduce dimensionality and maintain translation invariance. Moreover, the hierarchical architecture facilitates learning increasingly from low-level signal characteristics to high-level features. Recent studies have further enhanced CNN performance by incorporating attention modules [31, 32, 33], which allow it to concentrate on the most relevant spatial-temporal characteristics, aiming at cross-subject emotion recognition.
For instance, Yang et al. [34] designed a multi-column CNN for EEG emotion recognition, consisting of multiple modules that process temporal snapshots of EEG signals from the DEAP dataset. The final decision is generated through a weighted voting strategy, which combines the outputs of individual modules to enhance robustness and accuracy. They achieved accuracies of 90.01% and 90.65% for valence and arousal, respectively, demonstrating that the multi-column structure helps mitigate the impact of EEG variations. Hwang et al. [35] employed a CNN with topology-preserving differential entropy (DE) features to represent spatial information and enhance the resolution of EEG for emotion classification (positive, neutral, negative). Experiments on the SEED dataset yielded an accuracy of 90.41%, outperforming SVM with the radial basis function (RBF) kernel. Cui et al. [36] developed a DE-CNN-BiLSTM model, integrating DE, CNN, and bidirectional LSTM (BiLSTM) to process EEG signals. The DE features were extracted from different frequency bands and time slices, mapped into 4D tensors to represent brain spatial structure, and fed into the CNN for spatial feature learning. The BiLSTM was subsequently employed to capture the past and future temporal information. This model achieved accuracies of 94.86% for arousal and 94.02% for valence on the DEAP dataset, as well as 94.82% for the SEED dataset. Wang et al. [37] enhanced the resource efficiency of the CNN-based model by constructing six tasks through signal transformations to generate labels for the unlabeled EEG. A multi-task CNN was then trained to recognize these transformations. Next, the convolutional layers were frozen, and the fully connected layers were reconstructed for emotion recognition. Experiments on the SEED and DEAP datasets revealed that self-supervised learning can improve classification accuracy. On the SEED dataset, the average accuracy was 84.54% for the preprocessed data and 98.65% for the data with extracted DE features. For the DEAP dataset, the network acquired high F1-scores, with valence and arousal metrics attaining approximately 96% when trained on 20% of the data. Yao et al. [38] integrated a transformer and a CNN to extract spatial-temporal features for emotion recognition, which employed position encoding and multi-head attention mechanisms to represent channel positions and timing information from EEG. Two parallel transformer encoders extracted spatial and temporal features, which were then aggregated by a CNN and classified using softmax. Experiments conducted on the SEED and DEAP datasets showed that the model achieved accuracies of 96.67% on the SEED dataset and 95.73%, 96.95%, and 96.34% for the arousal-valence, arousal, and valence tasks on the DEAP dataset, respectively. Lu et al. [39] developed a convolution-multilayer perceptron network (CMLP-Net), where its architecture contained a temporal-stream shared convolution to extract shared features across consecutive time steps and reduce redundancy, a time-refinement temporal-spatial convolution to extract compelling temporal-spatial features, and a spatial interaction MLP to enhance the global spatial dependency of the features. Hence, CMLP-Net transformed 1D EEG signals into a 2D representation to better express spatial information. Experiments from the DEAP dataset revealed average accuracies of 98.65%, 98.70%, and 98.63% for valence, arousal, and dominance tasks. Qiao et al. [40] incorporated a temporal convolutional attention network to represent both local and global features of EEG signals. DE features were extracted and processed through a CNN to obtain local features. Subsequently, the self-attention mechanism was applied to enhance global feature extraction, followed by a BiLSTM network to investigate temporal dependencies. The experiments were performed on a self-collected dataset and the DEAP dataset, achieving average accuracies of 93.45% and 96.36% for valence and arousal, respectively. In addition to software-based deep learning models, recent research has explored hardware-efficient approaches for emotion recognition. For example, Ezilarasan and Leung [41] proposed an field programmable gate array (FPGA)-based architecture that extracts EEG features and classifies emotions using a lightweight approach, aiming to provide low-latency, energy-efficient processing suitable for embedded systems and illustrating the potential of real-time emotion-aware applications.
According to the related works discussed above, deep learning models, such as CNNs, offer potential for robust EEG emotion recognition. Nonetheless, they are black-box models that cannot provide insightful properties related to neuroscience knowledge. Therefore, understanding how and why rhythmic features influence emotion recognition is a challenging task. Additionally, reducing input temporal data is beneficial for deploying the model on resource-limited embedded devices. Meanwhile, a model that does not rely on individual data but is available for all subjects is more desired. Hence, it is meaningful to develop a subject-independent CNN model employing brain rhythms, which is the motivation of this work.
For clarity, the overall workflow of the proposed method is illustrated in Fig. 1. First, the EEG signals are acquired from the DEAP dataset, which adopts music videos as stimuli. The multi-channel recordings are collected using a 32-channel system. Next, the short-time Fourier transform (STFT) is applied to convert each EEG from the time domain into the frequency domain, and PSDs are extracted based on various brain rhythms. Subsequently, rhythmic-based 2D images are projected by those extracted PSDs with spatial information, providing a comprehensive representation of channel, rhythm, and temporal properties. After that, the EEG-ERnet is designed using the depthwise parallel CNN architecture, which is then employed for training and testing the rhythmic image features through 10-fold cross-validation. Finally, the most distinguishable brain rhythms associated with specific time intervals are investigated in detail for diverse emotion recognition tasks, offering subject-independent solutions in this field.
Fig. 1. The overall workflow of the proposed method. EEG, electroencephalography.
Data acquisition is the first step in EEG studies. As the primary objective is to design a subject-independent deep learning model, a publicly available dataset is a suitable choice. Therefore, the DEAP dataset, developed by Koelstra et al. [42], was evaluated. This dataset was designed to analyze emotional states, incorporating EEG recordings and comprehensive subjective assessments. Such information makes it valuable for cross-subject evaluation.
In detail, DEAP contains EEG recordings from 32 subjects (17 males and 15 females, aged 27.19
Concerning the emotional scenarios, the DEAP dataset contains ratings for three fundamental dimensions: valence, arousal, and dominance, based on a 9-point scale provided by the subjects, i.e., self-assessment manikin (SAM), as illustrated in Fig. 2. Valence denotes the degree of pleasantness of an emotional state, from negative (sadness, fear) to positive (happiness, excitement). Arousal reflects the physiological activation level, from calm (relaxation, boredom) to excitement (stress, enthusiasm). Dominance refers to the sense of control over emotional experiences, ranging from submissive (fear, helplessness) to dominant (confidence, empowerment). These dimensions form a framework for describing emotional states, where specific emotions can be mapped to different regions within this space. For instance, high valence and high arousal correspond to emotions such as joy or excitement, while low valence and high arousal denote emotions like anger or fear. Meanwhile, the liking ratings are provided, offering an additional aspect to assess subjective preferences in emotion. Therefore, this work focuses on valence, arousal, dominance, and liking tasks, where the binary classification is based on a threshold of 5, following the common practice [43] with the DEAP dataset. After binarization, the class distributions for valence, arousal, dominance, and liking are found to be reasonably balanced, with high/low class ratios of approximately 54:46, 52:48, 51:49, and 53:47, respectively. While these ratios do not represent a perfect balance, they are sufficiently close to ensure fair model training without the need for class-weighting or data balancing techniques.
Fig. 2. Self-assessment manikin (SAM) in the DEAP dataset. DEAP, dataset for emotion analysis using physiological signals.
The initial step in feature extraction involves obtaining the PSDs, a prevalent process in EEG emotion recognition using CNN, as incorporating PSDs as features into a CNN helps the model discern valuable emotion-associated patterns [18, 32]. To this end, the STFT is applied. The EEG x(t) is divided into overlapping segments utilizing a sliding window function to reduce spectral leakage. In this work, the window function employs the Hamming window of 128 samples with 50% overlap, ensuring a balance between time and frequency resolution. The STFT is then acquired by:
where w(n-t) denotes the Hamming window function centered at time t, e-𝑗2π𝑓𝑛 represents the complex exponential term of the Fourier basis functions, which performs the transformation from the time domain to the frequency f.
The squared magnitude of the STFT provides the spectrogram from which the PSD is derived. To link with the brain rhythms, the frequency bins corresponding to each range are summed. Thus, the PSD for each brain rhythm is calculated by:
where the PSD is integrated over the predefined brain rhythms BK
Once the PSDs are extracted, they are organized into spectral features for each channel. To further enhance the ability to present spatial information, these spectral features are projected onto a 2D image that preserves the spatial arrangement of high-dimensional EEG channels, i.e., these features are arranged into 2D matrices
Fig. 3. The 9
Additionally, the temporal dynamics indicate the changes in emotional responses. As a result, the inputs are generated by smaller segments, each representing a specific 5-second interval, which helps analyze the EEG signals over time and removes redundant data. For each 5-second EEG data, the PSDs of various brain rhythms are extracted and projected onto a 9
Fig. 4. A normalization sample for rhythmic image (DEAP, subject S12,
Concerning EEG-ERnet, a depthwise parallel CNN architecture is applied to address the computational efficiency and accuracy requirements for EEG emotion recognition. Its kernel design phase involves a pre-test to determine parameters for the convolutional layers. The initialization method of the convolutional kernels has been evaluated through comparative analysis, where Gaussian distribution initialization demonstrates an improvement in accuracy over zero initialization, achieving a 45% enhancement (94.58% vs. 49.88%) in a preliminary investigation. Subsequently, the kernel size has been pre-tested with 2
The traditional CNN structure consists of three continuous convolutional layers without max pooling layers. The convolution filter numbers are set at 64, 128, and 256, respectively, with a 1
Now, the first layer utilizes a depthwise convolution layer with 16 filters, generating 64 separate feature maps. A split layer then divides these feature maps into four sub-batches. Four parallel continuous CNNs are designed to process each sub-batch independently. Each continuous CNN consists of two convolution layers without pooling layers, with filter numbers set at 32 and 64, respectively. A concatenate layer is subsequently employed to stack the feature maps, followed by a 1
Algorithm 1: Parallel EEG-ERnet Algorithm
Input: Four image views {I1, I2, I3, I4}
Output: Predicted label y
1. In parallel, compute each branch output.
2. Concatenate the outputs: F = Concat(F1, F2, F3, F4).
3. Apply dropout, flattening, and a fully connected layer.
4. Compute the prediction y.
Fig. 5. The EEG-ERnet model based on depthwise parallel CNN. CNN, convolutional neural network.
| Compoment | Filter shape | Input size |
|---|---|---|
| cov_1 | 4 × 4 × 64 | 9 × 9 × 4 |
| cov_2 | 4 × 4 × 128 | 9 × 9 × 64 |
| cov_3 | 4 × 4 × 256 | 9 × 9 × 128 |
| cov_4 | 1 × 1 × 64 | 9 × 9 × 256 |
| fully connected | 5184 × 1024 | 9 × 9 × 64 |
| softmax | / | 1 × 1024 |
Beyond kernel size and initialization, several hyperparameters are optimized through empirical evaluation using a grid search on the training folds. The learning rate is set to 1
Finally, to ensure subject independence, all preprocessing steps, including STFT, PSD computation, and min-max normalization, are applied independently to each trial. STFT and PSD are computed on a per-trial basis, utilizing predefined frequency ranges without any data outside the trial. The min-max normalization is performed on each rhythmic-based 2D image, with the minimum and maximum values calculated from that image alone. No statistics are shared across training and testing folds, providing strict separation and eliminating any potential information leakage in the subject-independent evaluation protocol. Then, a subject-wise 10-fold cross-validation is adopted. Specifically, the DEAP dataset, including 32 subjects, is partitioned such that each fold contains data from approximately 3–4 individual subjects, which are entirely withheld for testing, while the model is trained on data from the remaining subjects. This process is repeated ten times, so that each subject serves as the test set once. The strict separation of subjects between training and testing sets prevents subject-specific feature leakage, supporting the development of generalizable models suitable for cross-subject emotion recognition scenarios. The performance is evaluated based on the average accuracy across all ten folds, which can avoid overfitting the training data. Please note that, since brain rhythms and time intervals are key aspects assessed in this work, the training and testing are based on specific rhythmic image features extracted at the same 5-second interval across 32 subjects. To clarify the sequence of operations in EEG-ERnet, a pseudocode-style algorithm (Algorithm 2) is provided that summarizes the input, processing stages, and output. It begins with 2D input images and proceeds through the multi-branch CNNs with depthwise separable convolution layers, eventually producing the final classification output.
Algorithm 2: EEG-ERnet Classification Procedure
Input: rhythmic-based 2D image
Output: Predicted label y
1. Apply depthwise separable convolutions in 4 parallel branches.
2. Perform ReLU activation, batch normalization, and max pooling on each branch.
3. Concatenate the outputs of all branches.
4. Apply dropout, flattening, and a fully connected layer.
5. Use a softmax function to obtain the final prediction.
Here, the computational complexity per branch is approximately O(K2
In this work, MATLAB R2023b (The MathWorks Inc., Natick, MA, USA) was used for programming, and the random seed was set to 42. No learning rate scheduler was applied. Training was performed on an NVIDIA ray tracing eXtreme (RTX) 3090 graphics processing unit (GPU) using compute unified device architecture (CUDA) 11.6 (NVIDIA Corp., Santa Clara, CA, USA) on Ubuntu 20.04 (Canonical Ltd., London, UK), with early stopping applied after 10 epochs of no improvement. The same configuration was employed across all cross-validation folds and rhythm-interval evaluations. Extensive experiments were conducted based on four tasks: valence, arousal, dominance, and liking. Consequently, it is meaningful to identify distinguishable rhythms and appropriate 5-second intervals to recognize different emotions, offering insights into the characteristics of emotion recognition. To this end, all classification results were evaluated using the mean and standard deviation of accuracy across the 10-fold cross-validation for the valence, arousal, dominance, and liking tasks, as detailed in Tables 2,3,4,5. No inferential statistical tests were applied, as the primary goal of this work is to assess network model performance across rhythm-specific image inputs rather than test specific hypotheses. Also, please note that no demographic covariates were included due to the limited metadata available in the dataset.
| Time interval | |||||
|---|---|---|---|---|---|
| 0–5 s | 86.31 | 88.54 | 90.60 | 86.65 | 82.04 |
| 5–10 s | 89.88 | 87.59 | 90.63 | 84.11 | 82.76 |
| 10–15 s | 89.15 | 87.47 | 88.50 | 80.65 | 88.79 |
| 15–20 s | 89.18 | 89.99 | 85.15 | 82.23 | 86.18 |
| 20–25 s | 87.63 | 83.82 | 88.81 | 80.36 | 88.61 |
| 25–30 s | 88.96 | 82.26 | 87.76 | 84.85 | 90.60 |
| 30–35 s | 85.38 | 83.56 | 91.07 | 82.01 | 88.11 |
| 35–40 s | 83.60 | 82.83 | 87.44 | 85.05 | 89.68 |
| 40–45 s | 83.02 | 87.46 | 88.67 | 87.66 | 90.63 |
| 45–50 s | 88.93 | 90.16 | 89.90 | 84.54 | 88.22 |
| 50–55 s | 89.82 | 90.97 | 90.94 | 88.89 | 87.72 |
| 55–60 s | 86.20 | 90.83 | 93.27 | 91.99 | 87.46 |
| Time interval | |||||
|---|---|---|---|---|---|
| 0–5 s | 87.23 | 84.76 | 86.43 | 87.14 | 87.79 |
| 5–10 s | 84.56 | 85.44 | 87.21 | 89.16 | 85.18 |
| 10–15 s | 88.91 | 88.57 | 84.73 | 90.51 | 85.60 |
| 15–20 s | 85.34 | 87.05 | 88.93 | 90.75 | 86.13 |
| 20–25 s | 85.67 | 81.63 | 82.35 | 86.75 | 88.58 |
| 25–30 s | 83.78 | 86.03 | 82.85 | 86.17 | 84.54 |
| 30–35 s | 86.45 | 83.69 | 86.28 | 88.76 | 85.74 |
| 35–40 s | 88.12 | 88.33 | 87.91 | 84.05 | 89.60 |
| 40–45 s | 84.89 | 90.18 | 88.21 | 86.14 | 88.90 |
| 45–50 s | 88.34 | 87.86 | 86.21 | 90.87 | 89.34 |
| 50–55 s | 87.56 | 88.52 | 86.62 | 92.16 | 88.69 |
| 55–60 s | 89.23 | 90.41 | 89.72 | 91.40 | 90.77 |
| Time interval | |||||
|---|---|---|---|---|---|
| 0–5 s | 82.34 | 82.04 | 84.97 | 82.67 | 85.45 |
| 5–10 s | 81.12 | 82.78 | 82.27 | 86.45 | 88.23 |
| 10–15 s | 84.56 | 88.06 | 82.80 | 84.89 | 82.89 |
| 15–20 s | 82.23 | 86.90 | 82.92 | 85.67 | 82.67 |
| 20–25 s | 85.78 | 88.32 | 83.73 | 86.34 | 81.67 |
| 25–30 s | 84.12 | 89.55 | 83.26 | 87.91 | 81.23 |
| 30–35 s | 80.12 | 87.21 | 86.55 | 87.78 | 85.89 |
| 35–40 s | 80.45 | 89.48 | 87.17 | 88.45 | 86.23 |
| 40–45 s | 82.34 | 88.32 | 85.69 | 90.01 | 85.89 |
| 45–50 s | 82.45 | 89.88 | 87.17 | 86.45 | 84.23 |
| 50–55 s | 80.78 | 89.62 | 87.98 | 89.67 | 84.78 |
| 55–60 s | 85.89 | 90.56 | 88.81 | 90.09 | 84.89 |
| Time interval | |||||
|---|---|---|---|---|---|
| 0–5 s | 82.45 | 84.78 | 80.67 | 85.34 | 86.08 |
| 5–10 s | 80.34 | 83.56 | 83.45 | 85.82 | 86.68 |
| 10–15 s | 80.67 | 84.12 | 82.89 | 85.92 | 84.45 |
| 15–20 s | 81.34 | 83.78 | 80.67 | 85.23 | 85.78 |
| 20–25 s | 81.89 | 82.12 | 81.34 | 84.78 | 84.12 |
| 25–30 s | 82.56 | 80.78 | 77.01 | 84.45 | 84.45 |
| 30–35 s | 80.23 | 80.45 | 79.78 | 83.12 | 82.78 |
| 35–40 s | 80.89 | 79.12 | 78.45 | 84.89 | 82.12 |
| 40–45 s | 80.56 | 77.78 | 79.01 | 81.34 | 81.18 |
| 45–50 s | 80.23 | 79.45 | 80.45 | 80.00 | 82.12 |
| 50–55 s | 80.89 | 81.12 | 83.89 | 77.78 | 83.56 |
| 55–60 s | 81.56 | 79.78 | 83.67 | 78.45 | 82.78 |
Table 2 presents classification accuracies for valence, showing the 55–60 s interval where the
Table 3 focuses on the arousal task and reveals that the
Table 4 indicates that the
Table 5 shows that the
Finally, the varying accuracies across classification tasks may be attributed to inherent emotional complexity. Emotions, such as valence and arousal, are two fundamental dimensions that are widely recognized for understanding human emotions. Dominance and liking may involve more subjective and context-dependent interpretations. In this regard, the proposed EEG-ERnet demonstrates its ability to maintain impressive performance across individuals and tasks, since its architecture incorporates depthwise parallel CNNs that can effectively analyze both local and global features of rhythmic-based 2D images. The explainability of EEG-ERnet further contributes to providing insights into the brain rhythms and time intervals that are most beneficial in emotion recognition, improving its decision-making process in various cases across subjects. It is vital for real-world applications where emotional recognition systems should be adaptable to different users and contexts.
A comprehensive comparative study was conducted to evaluate the proposed EEG-ERnet. First, in the ablation experiment, an initial baseline model consisting of three depthwise convolution layers, a 1
| Model | Valence | Arousal | Dominance | Liking |
|---|---|---|---|---|
| Baseline | 64.68 | 69.34 | 64.06 | 59.36 |
| EEG-ERnet | 93.27 | 92.16 | 90.56 | 86.68 |
Table 6 shows that the baseline model, which utilizes cascading depthwise convolution layers, suffers from a limitation in its architecture. Depthwise convolution, while computationally efficient, lacks feature integration. Hence, it cannot fully exploit the multi-dimensional nature of the data to make accurate classifications. In contrast, EEG-ERnet employs a parallel architecture that enhances feature integration, resulting in improved performance across all dimensions. Such an architectural advantage makes the EEG-ERnet more appropriate for cross-subject emotion recognition tasks.
A comprehensive overview of various EEG emotion recognition methods utilizing the DEAP dataset is offered by the comparative study presented in Table 7 (Ref. [14, 18, 29, 34, 51, 52, 53, 54, 55]). Different approaches have been assessed, including traditional machine learning algorithms such as k-NN, SVM, and RF, as well as more advanced deep learning techniques like CNN and its variants. Mahmoud et al. [18] employed a 2D-CNN with PSD features, achieving the highest valence and arousal recognition accuracy of 94.23% and 93.78%, respectively. However, their method did not cover all dimensions. The proposed EEG-ERnet has demonstrated impressive performance across all dimensions, achieving accuracies of 93.27% for valence, 92.16% for arousal, 90.56% for dominance, and 86.68% for liking. It is advantageous over previous works, particularly in terms of stable performance across subjects and emotional factors, which are key concerns.
| Work | Classifier | Feature | Classification accuracy (%) | |||
| Valence | Arousal | Dominance | Liking | |||
| Wang et al. [14] | 2D-CNN-LSTM | DEFM | 91.92 | 92.31 | / | / |
| Mahmoud et al. [18] | 2D-CNN | PSD | 94.23 | 93.78 | 89.54 | / |
| Sarma and Barma [29] | k-NN, SVM, RF | PSD, CWT | 82.21 | 86.03 | / | / |
| Yang et al. [34] | Multi-column CNN with weighted sum | Temporal snapshots of EEG signals | 90.01 | 90.65 | / | / |
| Farokhah et al. [51] | Simplified 2D-CNN | Spectrogram images generated from ten selected EEG channels | 89.31 | 91.28 | / | / |
| Lin et al. [52] | Channel selection graph neural network | DE, PLI | 90.74 | 91.00 | / | / |
| Al-Asadi et al. [53] | Semi-supervised EEG-based emotion classifier by appropriate regularization terms | Raw EEG signals with two types of augmentations | 88.44 | 91.77 | / | / |
| Yilmaz et al. [54] | k-NN, SVM | AAG, SIFT | 90.94 | 92.44 | / | / |
| Wan et al. [55] | Light gradient boosting machine | PSD, DE, SASI, wavelet energy, entropy | 84.03 | 84.37 | / | / |
| This work | EEG-ERnet based on depthwise parallel CNN | Rhythmic-based 2D image | 93.27 | 92.16 | 90.56 | 86.68 |
CNN, convolutional neural network; LSTM, long short-term memory; DEFM, differential entropy feature matrix; PSD, power spectral density; k-NN, k-nearest neighbors; SVM, support vector machine; RF, random forest; CWT, continuous wavelet transform; PLI, phase lag index; AAG, angle amplitude graphs; SIFT, scale-invariant feature transform; SASI, spectral asymmetry index; DE, differential entropy.
Compared to methods that rely on temporal snapshots or spectrogram images, the EEG-ERnet enables the analysis of spatial and temporal information from rhythmic image features. It exhibits several advantages in EEG emotion recognition. First, by utilizing depthwise convolution layers, the model analyzes the spatial information of EEG signals more effectively than traditional CNNs. This architecture reduces computational costs and enhances the ability to identify spatial patterns associated with specific emotions. Second, the parallel processing of sub-batches offers more comprehensive feature integration, allowing the model to handle high-dimensional and complex data. As demonstrated in the experiments, such design choices collectively contribute to superior performance across four dimensions. Additionally, the results provide valuable insights into the characteristics of EEG emotion recognition, as they identify specific brain rhythms and time intervals associated with recognizing emotions. Such findings align with the known roles of these rhythms in emotional processing. Thus, the model identifies the key factors relevant to particular dimensions. Overall, the accuracies acquired in four dimensions demonstrate that it is well-suited to recognize the complex patterns in EEG signals associated with different emotions. It can be said that the EEG-ERnet provides an impressive solution that outperforms existing methods by addressing the limitations of incomplete dimensions in a subject-independent manner with only 5-second data sources.
First, this work focuses on the five brain rhythms due to their strong neurophysiological grounding in the context of emotion recognition. While emerging techniques such as EMD can extract non-standard or adaptive frequency components, the proposed method prioritizes choice based on explainability and computational feasibility. Nevertheless, it is recognized that the potential of such methods, including high-frequency bands beyond 60 Hz, lies in uncovering additional emotion-relevant information. Future work will explore the integration of EMD-derived IMFs into the 2D image framework to enhance the discriminative ability for multiple-level classification tasks.
Second, the choice of a 5-second interval in this work is due to the balance between temporal resolution and spectral stability. This duration has also been used in emotion recognition and aligns with previous works [3, 4] on the DEAP dataset. A shorter interval, like 2 seconds, may offer finer temporal granularity. Still, they could suffer from reduced frequency resolution, particularly for low-frequency bands. In contrast, a longer interval, such as 10 seconds, may average out dynamic changes and reduce the number of training samples. Although this work adopts a fixed non-overlapping 5-second interval, future work will investigate the effects of different segmentations, including overlapping and multi-scale windows, to improve real-time responsiveness and model performance.
Third, min-max normalization is applied to each rhythmic image to rescale PSD values to the [0, 1] range, which provides a consistent input scale across samples and reduces the impact of extreme values. While this method preserves the relative topographic and spectral structure of each input, it does not standardize inter-subject statistics. Therefore, to mitigate inter-subject variability, a subject-independent cross-validation involves training and testing on entirely different sets of individuals. Although no explicit domain adaptation is adopted, it promotes cross-subject generalization. Future work will incorporate advanced inter-subject normalization or domain-adaptive learning techniques, such as Riemannian alignment, statistical matching, or adversarial adaptation, to further enhance model robustness in highly diverse populations.
Next, the DEAP dataset assessed in this work includes EEG recordings from 32 subjects, spanning a reasonable demographic range. But it does not represent specific clinical populations. Also, it is relatively small compared to datasets in other EEG-based studies, which may constrain the model’s generalizability to broader or clinical populations. Meanwhile, in the DEAP dataset, the emotional responses may be influenced by cultural or psychological factors not accounted for, and no inferential statistical tests or covariate analyses are performed based on the results obtained, which limits the investigation of group differences, a limitation regarding personalized affective modeling. Future work will involve larger, more diverse studies and the introduction of covariate-aware modeling to enhance personalization.
Furthermore, EEG signals in the DEAP dataset may reflect overlapping cognitive phenomena beyond core emotional responses, including subjective perceptions like luck, expectation, or decision uncertainty. These non-emotional components can confound emotion recognition. To address this issue, the proposed method mitigates it by using band-specific rhythmic representations and parallel CNN branches, which help isolate emotion-relevant frequency dynamics. Meanwhile, the use of subject-independent training emphasizes generalizable emotion-related features while suppressing subject-specific cognitive noise. Future work will incorporate explicit component separation techniques such as adversarial learning to disentangle emotional signals from co-occurring cognitive influences.
Finally, this work identifies the brain rhythm and temporal interval combinations that are most informative for each emotional dimension. To this end, the experiments have thoroughly evaluated the model’s performance across 60 rhythm-interval configurations in a controlled and consistent cross-validation setting. All evaluations are performed using subject-independent cross-validation, guaranteeing that performance is not inflated due to subject-specific overfitting. Therefore, the results add a layer of generalization to mitigate risk from testing multiple configurations. Such an approach is suitable for portable emotion-aware devices, as it utilizes fewer data sources with only 5-second data. In the future, a multi-configuration approach will be considered to enhance performance across various cases. Trivial baselines and nested cross-subjects will also be incorporated into framework-related applications, such as those for depression detection, to improve the model’s advantage.
This work proposes the EEG-ERnet model, which employs a depthwise parallel CNN to classify the spatial and temporal features of emotional EEG signals. Using rhythmic-based 2D images extracted from multi-channel EEG recordings of specific 5-second intervals, the model can identify particular brain rhythms and time intervals most appropriate for recognizing different emotions. The experimental results from the DEAP dataset demonstrated that the
The EEG data analyzed during the current study are from a public dataset DEAP (http://www.eecs.qmul.ac.uk/mmv/datasets/deap). Other code will be made available from the corresponding author on reasonable request.
SZ, CL, JRW, and JLi designed the research. SZ, JLi, and JLv performed the research. CL, JJW, YY, XL, and JLv analyzed the data. SZ, JLi, MIV, and RC interpreted the results. SZ, CL, and JLi revised the paper. JRW, XL, and JLv conceptualized the method. SZ, JLi, JLv, MIV, and RC administrated the project and made some contributions to the figures. SZ, CL, JLi, and RC investigated the dataset and supported funding acquisition. JJW, YY, XL, and MIV provided computing resource and supervised the research. SZ, CL, JLi, and JLv wrote the manuscript. All authors contributed to editorial changes in the manuscript. All authors read and approved the final manuscript. All authors have participated sufficiently in the work and agreed to be accountable for all aspects of the work.
Not applicable.
The authors would like to appreciate the special contributions from Key Laboratory of Numerical Simulation of Sichuan Provincial Universities, ZUMRI-LYG Joint Lab, and Digital Content Processing and Security Technology of Guangzhou Key Laboratory.
This work was supported in part by the Guangzhou Science and Technology Plan Project under Grants 2024B03J1361 and 2023B03J1327, in part by the Research Fund of Key Laboratory of Numerical Simulation of Sichuan Provincial Universities under Grant 2024SZFZ007, in part by the Sichuan Science and Technology Program under Grant 2025ZNSFSC0780, in part by the Foundation of the 2023 Higher Education Science Research Plan of the China Association of Higher Education under Grant 23XXK0402, in part by the Foundation of the Sichuan Research Center of Applied Psychology (Chengdu Medical College) under Grant CSXL-25102, in part by the Neijiang Philosophy and Social Science Planning Project under Grant NJ2024ZD014, in part by the Guangdong Province Ordinary Colleges and Universities Young Innovative Talents Project under Grant 2023KQNCX036, in part by the Scientific Research Capacity Improvement Project of the Doctoral Program Construction Unit of Guangdong Polytechnic Normal University under Grant 22GPNUZDJS17, in part by the Graduate Education Demonstration Base Project of Guangdong Polytechnic Normal University under Grant 2023YJSY04002, in part by the Open Research Fund of State Key Laboratory of Digital Medical Engineering under Grant 2025-M10, and in part by the Research Fund of Guangdong Polytechnic Normal University under Grant 2022SDKYA015.
The authors declare no conflict of interest.
The manuscript was written entirely by the authors. AI-based tools (Grammarly, ChatGPT-4.0) were used only for minor English language correction and grammar checking. All intellectual content, experiments, analysis, and interpretations were conceived, designed, and executed solely by the authors.
References
Publisher’s Note: IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.





