Speech emotion recognition (SER) is considered a pivotal area of research that holds significant importance in a variety of real-time applications, such as assessing human behavior and analyzing the emotional states of speakers in emergency situations. This paper assesses the capabilities of deep convolutional neural networks (CNNs) in this context. Both CNNs and Long Short-Term Memory (LSTM) based deep neural networks are evaluated for voice emotion identification. In our empirical evaluation, we utilize the Toronto Emotional Speech Set (TESS) database, which comprises speech samples from both young and old individuals, encompassing seven distinct emotions: anger, happiness, sadness, fear, surprise, disgust, and neutrality. To augment the dataset, variations in voice are introduced along with the addition of white noise. The empirical findings indicate that the CNN model outperforms existing studies on SER using the TESS corpus, yielding a noteworthy 21% improvement in average recognition accuracy. This work underscores SER’s significance and highlights the transformative potential of deep CNNs for enhancing its effectiveness in real-time applications, particularly in high-stakes emergency situations.