A Systematic Framework for Fault Management in embedded systems for Mobile Networks: Probabilistic Modeling, Cost-Aware Mitigation, and Board-Level Design
2026 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]
The evolution of telecommunications networks continues without interruption, transforming society and redefining many of the services we've become familiar with, often improving the quality of life. Connectivity, mobility, and the network's ability to support new services have changed the way we perform daily activities: reading a newspaper, managing a bank account, playing games, or communicating. And technologies like IoT, artificial intelligence, virtualization, and the widespread availability of data are paving the way for services that are even difficult to imagine today.
As services become increasingly central and critical, their availability over time cannot be compromised without a significant impact. Disruption can cause a severe setback for those who rely on those services. Network resilience becomes a key requirement for next-generation mobile networks. But what does it mean to have a resilient network? What makes a system resilient, and how can we improve it? Our research starts with these key questions. While using the classic definition of resilience (the ability of a system to respond rapidly to internal or external stresses that compromise performance), we argue that improving it requires identifying the system functions that contribute to resilience and quantifying their contributions.There is also a connection to reliability. Reliability describes a system's ability to maintain expected performance over time under normal conditions. In contrast, resilience is the ability to withstand abnormal conditions and restore proper operation. Both involve managing faults. Increasing service availability means reducing the probability of faults and the time and costs to resolve them.
In this sense, if resilience is the system's ability to respond to an internal or external event that compromises its performance, then fault management (the process of detecting, isolating, and correcting faults) is the system function through which we address faults and seek to resolve them.Our research, as presented in this thesis, produced several key outcomes. We defined a resilience model for mobile network nodes, explored the concept of failure, and developed a flexible method for calculating failure probability. We investigated its probability and the costs to mitigate its impact. Based on these results, we proposed a framework for fault management and identified an optimal hardware design to improve reliability and resilience. Maintaining a holistic approach was necessary to develop a fault management model for all system levels—hardware, firmware, and software. Each layer works autonomously when possible or with others to achieve fault management's essential goal: recovery from a faulty condition. This leads to board-level design requirements for high-quality service and experience.
Place, publisher, year, edition, pages
Mälardalens universitet, 2026. , p. 188
Series
Mälardalen University Press Dissertations, ISSN 1651-4238 ; 466
Keywords [en]
Hardware Fault Management, Embedded Systems, System Resiliency, Dependability
National Category
Embedded Systems
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:mdh:diva-76637ISBN: 978-91-7485-757-3 (print)OAI: oai:DiVA.org:mdh-76637DiVA, id: diva2:2055715
Public defence
2026-06-09, Lambda, och digitalt, Mälardalens universitet, Västerås, 13:15 (English)
Opponent
Supervisors
Funder
Knowledge Foundation2026-04-292026-04-272026-05-19Bibliographically approved