https://www.mdu.se/

mdu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Reliability-oriented Faults Taxonomy and a Recovery-oriented Methodological Approach for Systems Resilience
Technology Management Ericsson Ab, Stockholm, Sweden.
Mälardalen University, School of Innovation, Design and Engineering, Embedded Systems.ORCID iD: 0000-0002-5032-2310
Sys Compute Dimensioning Ericsson Ab, Stockholm, Sweden.
Sys Architecture Ericsson Ab, Stockholm, Sweden.
Show others and affiliations
2022 (English)In: Proceedings - 2022 IEEE 46th Annual Computers, Software, and Applications Conference, COMPSAC 2022, IEEE, 2022, p. 48-55Conference paper, Published paper (Refereed)
Abstract [en]

Fault management is an important function that impacts the design of any digital system, from the simple kiosk in a shop to a complex 6G network. It is common to classify fault conditions into different taxonomies using terms like fault or error. Fault taxonomies are often suitable for managing fault detection, fault reporting, and fault localization but often neglect to support all different functions required by a fault management process. A correctly implemented fault management process must be able to distinguish between defects and faults, decide upon ap-propriate actions to recover the system to an ideal state, and avoid an error condition. Fault management is a multi-disciplinary process where recovery actions are deployed promptly by com-bined hardware, firmware, and software orchestration. The importance of fault management processes significantly increases with modern nanometer technologies, which suffer the risk of so-called soft errors, a corruption of a bit cells that can happen due to spurious disturbance, like cosmic radiation. Modern fault management implementations must support recovery actions for soft errors to ensure a steady system. This paper describes an extended fault classification model that emphasizes fault management and recovery actions. We aim to show how the reliability-based fault taxonomy definition is more suitable for the overall fault management process.

Place, publisher, year, edition, pages
IEEE, 2022. p. 48-55
Keywords [en]
Fault management, Fault taxonomy, Fault topology, Reliability
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:mdh:diva-59894DOI: 10.1109/COMPSAC54236.2022.00016ISI: 000855983300008Scopus ID: 2-s2.0-85136988154ISBN: 9781665488105 (print)OAI: oai:DiVA.org:mdh-59894DiVA, id: diva2:1694026
Conference
2022 IEEE 46th Annual Computers, Software, and Applications Conference, Online, 27/6-1/7 2022
Note

Export Date: 8 September 2022; Conference Paper

Available from: 2022-09-08 Created: 2022-09-08 Last updated: 2024-03-11Bibliographically approved
In thesis
1. The role of fault management in the embedded system design
Open this publication in new window or tab >>The role of fault management in the embedded system design
2024 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

In the last decade, the world of telecommunications has seen the value ofservices definitively affirmed and the loss of the connectivity value. This changeof pace in the use of the network (and available hardware resources) has ledto continuous, unlimited growth in data traffic, increased incomes for serviceproviders, and a constant erosion of operators’ incomes for voice and ShortMessage Service (SMS) traffic.The change in mobile service consumption is evident to operators. Themarket today is in the hands of over the top (OTT) media content deliverycompanies (Google, Meta, Netflix, Amazon, etc.), and The fifth generation ofmobile networks (5G), the latest generation of mobile architecture, is nothingother than how operators can invest in system infrastructure to participate in theprosperous service business.With the advent of 5G, the worlds of cloud and telecommunications havefound their meeting point, paving the way for new infrastructures and ser-vices, such as smart cities, industry 4.0, industry 5.0, and Augmented Reality(AR)/Virtual Reality (VR). People, infrastructures, and devices are connected toprovide services that we even struggle to imagine today, but a highly intercon-nected system requires high levels of reliability and resilience.Hardware reliability has increased since the 1990s. However, it is equallycorrect to mention that the introduction of new technologies in the nanometerdomain and the growing complexity of on-chip systems have made fault man-agement critical to guarantee the quality of the service offered to the customerand the sustainability of the network infrastructure.

In this thesis, our first contribution is a review of the fault managementimplementation framework for the radio access network domain. Our approachintroduces a holistic vision in fault management where there is increasingly moresignificant attention to the recovery action, the crucial target of the proposedframework. A new contribution underlines the attention toward the recoverytarget: we revisited the taxonomy of faults in mobile systems to enhance theresult of the recovery action, which, in our opinion, must be propagated betweenthe different layers of an embedded system ( hardware, firmware, middleware,and software). The practical adoption of the new framework and the newtaxonomy allowed us to make a unique contribution to the thesis: the proposalof a new algorithm for managing system memory errors, both temporary (soft)and permanent (hard)The holistic vision of error management we introduced in this thesis involveshardware that proactively manages faults. An efficient implementation of faultmanagement is only possible if the hardware design considers error-handlingtechniques and methodologies. Another contribution of this thesis is the def-inition of the fault management requirements for the RAN embedded systemhardware design.Another primary function of the proposed fault management framework isfault prediction. Recognizing error patterns means allowing the system to reactin time, even before the error condition occurs, or identifying the topology of theerror to implement more targeted and, therefore, more efficient recovery actions.The operating temperature is always a critical characteristic of embedded radioaccess network systems. Base stations must be able to work in very differenttemperature conditions. However, the working temperature also directly affectsthe probability of error for the system. In this thesis, we have also contributed interms of a machine-learning algorithm for predicting the working temperature ofbase stations in radio access networks — a first step towards a more sophisticatedimplementation of error prevention and prediction.

Place, publisher, year, edition, pages
Västerås: Mälardalens universitet, 2024
Series
Mälardalen University Press Licentiate Theses, ISSN 1651-9256 ; 357
Keywords
Fault Management, Resilient system, Recovery methodology.
National Category
Telecommunications
Research subject
Computer Science
Identifiers
urn:nbn:se:mdh:diva-66227 (URN)978-91-7485-639-2 (ISBN)
Presentation
2024-04-18, Milos, Mälardalens universitet, Västerås, 13:15 (English)
Opponent
Supervisors
Available from: 2024-03-11 Created: 2024-03-11 Last updated: 2024-03-28Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Sundmark, DanielNolte, Thomas

Search in DiVA

By author/editor
Sundmark, DanielNolte, Thomas
By organisation
Embedded Systems
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 57 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf