https://www.mdu.se/

mdu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
The role of fault management in the embedded system design
Mälardalen University, School of Innovation, Design and Engineering, Embedded Systems. ERICSSON AB. (ARRAY++)ORCID iD: 0000-0003-2598-6796
2024 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

In the last decade, the world of telecommunications has seen the value ofservices definitively affirmed and the loss of the connectivity value. This changeof pace in the use of the network (and available hardware resources) has ledto continuous, unlimited growth in data traffic, increased incomes for serviceproviders, and a constant erosion of operators’ incomes for voice and ShortMessage Service (SMS) traffic.The change in mobile service consumption is evident to operators. Themarket today is in the hands of over the top (OTT) media content deliverycompanies (Google, Meta, Netflix, Amazon, etc.), and The fifth generation ofmobile networks (5G), the latest generation of mobile architecture, is nothingother than how operators can invest in system infrastructure to participate in theprosperous service business.With the advent of 5G, the worlds of cloud and telecommunications havefound their meeting point, paving the way for new infrastructures and ser-vices, such as smart cities, industry 4.0, industry 5.0, and Augmented Reality(AR)/Virtual Reality (VR). People, infrastructures, and devices are connected toprovide services that we even struggle to imagine today, but a highly intercon-nected system requires high levels of reliability and resilience.Hardware reliability has increased since the 1990s. However, it is equallycorrect to mention that the introduction of new technologies in the nanometerdomain and the growing complexity of on-chip systems have made fault man-agement critical to guarantee the quality of the service offered to the customerand the sustainability of the network infrastructure.

In this thesis, our first contribution is a review of the fault managementimplementation framework for the radio access network domain. Our approachintroduces a holistic vision in fault management where there is increasingly moresignificant attention to the recovery action, the crucial target of the proposedframework. A new contribution underlines the attention toward the recoverytarget: we revisited the taxonomy of faults in mobile systems to enhance theresult of the recovery action, which, in our opinion, must be propagated betweenthe different layers of an embedded system ( hardware, firmware, middleware,and software). The practical adoption of the new framework and the newtaxonomy allowed us to make a unique contribution to the thesis: the proposalof a new algorithm for managing system memory errors, both temporary (soft)and permanent (hard)The holistic vision of error management we introduced in this thesis involveshardware that proactively manages faults. An efficient implementation of faultmanagement is only possible if the hardware design considers error-handlingtechniques and methodologies. Another contribution of this thesis is the def-inition of the fault management requirements for the RAN embedded systemhardware design.Another primary function of the proposed fault management framework isfault prediction. Recognizing error patterns means allowing the system to reactin time, even before the error condition occurs, or identifying the topology of theerror to implement more targeted and, therefore, more efficient recovery actions.The operating temperature is always a critical characteristic of embedded radioaccess network systems. Base stations must be able to work in very differenttemperature conditions. However, the working temperature also directly affectsthe probability of error for the system. In this thesis, we have also contributed interms of a machine-learning algorithm for predicting the working temperature ofbase stations in radio access networks — a first step towards a more sophisticatedimplementation of error prevention and prediction.

Place, publisher, year, edition, pages
Västerås: Mälardalens universitet, 2024.
Series
Mälardalen University Press Licentiate Theses, ISSN 1651-9256 ; 357
Keywords [en]
Fault Management, Resilient system, Recovery methodology.
National Category
Telecommunications
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:mdh:diva-66227ISBN: 978-91-7485-639-2 (print)OAI: oai:DiVA.org:mdh-66227DiVA, id: diva2:1843504
Presentation
2024-04-18, Milos, Mälardalens universitet, Västerås, 13:15 (English)
Opponent
Supervisors
Available from: 2024-03-11 Created: 2024-03-11 Last updated: 2024-03-28Bibliographically approved
List of papers
1. Ambient Temperature Prediction for Embedded Systems using Machine Learning
Open this publication in new window or tab >>Ambient Temperature Prediction for Embedded Systems using Machine Learning
2023 (English)In: International Conference on Engineering of Computer-Based Systems / [ed] Springer, Västerås, Sweden: Springer Nature, 2023, p. 12-25Conference paper, Published paper (Refereed)
Abstract [en]

In this work, we use two well-established machine learning algorithms i.e., Random Forest (RF) and XGBoost, to predict ambient temperature for a baseband’s board. After providing an overview of the related work, we describe how we train the two ML models and identify the optimal training and test datasets to avoid the problems of data under- and over-fitting. Given this train/test split, the trained RF and XGBoost models provide temperature predictions with an accuracy lower than one degree Celsius, i.e., far better than any other approach that we used in the past. Our feature importance assessments reveal that the temperature sensors contribute significantly more towards predicting the ambient temperature compared to the power and voltage readings. Furthermore, the RF model appears less volatile than XGBoost using our training data. As the results demonstrate, our predictive temperature models allow for an accurate error prediction as a function of baseband board sensors.

Place, publisher, year, edition, pages
Västerås, Sweden: Springer Nature, 2023
Keywords
Predictive Maintenance, Temperature prediction, Radio Access Network
National Category
Telecommunications
Identifiers
urn:nbn:se:mdh:diva-66226 (URN)10.1007/978-3-031-49252-5_3 (DOI)2-s2.0-85180147389 (Scopus ID)978-3-031-49251-8 (ISBN)
Conference
ECBS2023
Available from: 2024-03-11 Created: 2024-03-11 Last updated: 2024-03-11
2. Run Time Memory Error Recovery Process in Networking System.
Open this publication in new window or tab >>Run Time Memory Error Recovery Process in Networking System.
Show others...
2023 (English)In: 7th IEEE International Conference on System Reliability and Safety / [ed] IEEE, Bologna, Italy: IEEE conference proceedings, 2023, p. 590-597Conference paper, Published paper (Refereed)
Abstract [en]

System memory errors have always been problematic; today, they cause more than forty percent of confirmed hardware errors in repair centers for both data centers and telecommunications network nodes. Therefore, it is somewhat expected that, in recent years, device manufacturers improved the hardware features to support hardware-assisted fault management implementation. For example, the new standard, DDR5, includes both data redundancy, the so-called Error Correcting Code (ECC), and physical redundancy, the post-package repair (PPR), as mandatory features. Production and repair centers mainly use physical redundancy to replace faulty memory rows. In contrast, field use still needs to be improved, mainly due to a need for integrated system solutions for network nodes. This paper aims to compensate for this shortcoming and presents a system solution for handling memory errors. It is a multi-technology proposition (mixed use of ECC and PPR) based on multi-layer (hardware, firmware, and software) error information exchange.

Place, publisher, year, edition, pages
Bologna, Italy: IEEE conference proceedings, 2023
Keywords
Memory Faults, Fault Management, Post-Package Repair, Error Correcting Code, Run Time Fault Recovering
National Category
Telecommunications
Identifiers
urn:nbn:se:mdh:diva-66225 (URN)10.1109/ICSRS59833.2023.10381346 (DOI)2-s2.0-85183463653 (Scopus ID)979-8-3503-0606-4 (ISBN)
Conference
ICSRS2023
Available from: 2024-03-11 Created: 2024-03-11 Last updated: 2024-04-10Bibliographically approved
3. A Reliability-oriented Faults Taxonomy and a Recovery-oriented Methodological Approach for Systems Resilience
Open this publication in new window or tab >>A Reliability-oriented Faults Taxonomy and a Recovery-oriented Methodological Approach for Systems Resilience
Show others...
2022 (English)In: Proceedings - 2022 IEEE 46th Annual Computers, Software, and Applications Conference, COMPSAC 2022, IEEE, 2022, p. 48-55Conference paper, Published paper (Refereed)
Abstract [en]

Fault management is an important function that impacts the design of any digital system, from the simple kiosk in a shop to a complex 6G network. It is common to classify fault conditions into different taxonomies using terms like fault or error. Fault taxonomies are often suitable for managing fault detection, fault reporting, and fault localization but often neglect to support all different functions required by a fault management process. A correctly implemented fault management process must be able to distinguish between defects and faults, decide upon ap-propriate actions to recover the system to an ideal state, and avoid an error condition. Fault management is a multi-disciplinary process where recovery actions are deployed promptly by com-bined hardware, firmware, and software orchestration. The importance of fault management processes significantly increases with modern nanometer technologies, which suffer the risk of so-called soft errors, a corruption of a bit cells that can happen due to spurious disturbance, like cosmic radiation. Modern fault management implementations must support recovery actions for soft errors to ensure a steady system. This paper describes an extended fault classification model that emphasizes fault management and recovery actions. We aim to show how the reliability-based fault taxonomy definition is more suitable for the overall fault management process.

Place, publisher, year, edition, pages
IEEE, 2022
Keywords
Fault management, Fault taxonomy, Fault topology, Reliability
National Category
Computer Sciences
Identifiers
urn:nbn:se:mdh:diva-59894 (URN)10.1109/COMPSAC54236.2022.00016 (DOI)000855983300008 ()2-s2.0-85136988154 (Scopus ID)9781665488105 (ISBN)
Conference
2022 IEEE 46th Annual Computers, Software, and Applications Conference, Online, 27/6-1/7 2022
Note

Export Date: 8 September 2022; Conference Paper

Available from: 2022-09-08 Created: 2022-09-08 Last updated: 2024-03-11Bibliographically approved
4. Fault Management Framework and Multi-layer Recovery Methodology for Resilient System
Open this publication in new window or tab >>Fault Management Framework and Multi-layer Recovery Methodology for Resilient System
Show others...
2022 (English)In: 2022 6th International Conference on System Reliability and Safety, ICSRS 2022, Institute of Electrical and Electronics Engineers Inc. , 2022, p. 32-39Conference paper, Published paper (Refereed)
Abstract [en]

Fault management is a key function to guarantee the quality of the service. Research has done a lot to improve fault supervision, and investigation is ongoing in fault prediction, thanks to the potentials of artificial intelligence and machine learning. In this study, we propose a fault management framework that puts an emphasis on fault recovery: a framework developed on multi-layer function and a fault recovery methodology distributed over several technological layers. The basic principle of our proposal is that the system's complexity exposes it to a higher probability of temporary error. Newfound attention to the fault recovery phase is the key to keeping the service's quality high and saving maintenance costs by decreasing the return rate. 

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers Inc., 2022
Keywords
Fault Management, Recovery methodology, Resilient system, Artificial intelligence, Artificial intelligence learning, Fault prediction, Fault recovery, Machine-learning, Management frameworks, Management IS, Multi-layers, Resilient systems, Failure analysis
National Category
Computer Sciences
Identifiers
urn:nbn:se:mdh:diva-62284 (URN)10.1109/ICSRS56243.2022.10067849 (DOI)000981836500005 ()2-s2.0-85151674429 (Scopus ID)9781665470926 (ISBN)
Conference
6th International Conference on System Reliability and Safety, ICSRS 2022, Venice 23 November 2022 through 25 November 2022
Available from: 2023-04-19 Created: 2023-04-19 Last updated: 2024-03-11Bibliographically approved
5. Fault Management Impacts on the Networking Systems Hardware Design
Open this publication in new window or tab >>Fault Management Impacts on the Networking Systems Hardware Design
Show others...
2023 (English)In: IECON Proceedings (Industrial Electronics Conference), IEEE Computer Society, 2023Conference paper, Published paper (Refereed)
Abstract [en]

Processing capacity distribution has become widespread in the fog computing era. End-user services have multiplied, from consumer products to Industry 5.0. In this scenario, the services must have a very high-reliability level. But in a system with such displacement of hardware, the reliability of the service necessarily passes through the hardware design. Devices shall have a high quality, but they shall also efficiently support fault management. Hardware design must take into account all fault management functions and participate in creating a fault management policy to ensure that the ultimate goal of fault management is fulfilled, namely to increase a system's reliability. Efficiently and sustainably, both in the system's performance and the product's cost. This paper analyzes the hardware design techniques that efficiently contribute to the realization of fault management and, consequently, guarantee a high level of reliability and availability for the services offered to the end customer. We describe hardware requirements and how they affect the choice of devices in the hardware design of networking systems.

Place, publisher, year, edition, pages
IEEE Computer Society, 2023
Keywords
availability, fault management, hardware design, networking system, reliability, requirements, serviceability, Design, Failure analysis, Fog computing, Capacity distribution, End-users, Networking systems, Processing capacities, Requirement, System hardware, User services, Consumer products
National Category
Other Mechanical Engineering
Identifiers
urn:nbn:se:mdh:diva-65181 (URN)10.1109/IECON51785.2023.10312698 (DOI)2-s2.0-85179525819 (Scopus ID)9798350331820 (ISBN)
Conference
49th Annual Conference of the IEEE Industrial Electronics Society, IECON 2023, Singapore, 16 October through 19 October 2023
Available from: 2023-12-21 Created: 2023-12-21 Last updated: 2024-03-11Bibliographically approved

Open Access in DiVA

fulltext(2102 kB)100 downloads
File information
File name FULLTEXT02.pdfFile size 2102 kBChecksum SHA-512
a1990af2dce7bb5dd1574491803d28b34ac6151960fb0181b2a90d48acdb63324655487d9504f4841f8c8e6742cc10c76f0826a2eb372b331e8a3787aa8f47e5
Type fulltextMimetype application/pdf

Authority records

Vitucci, Carlo

Search in DiVA

By author/editor
Vitucci, Carlo
By organisation
Embedded Systems
Telecommunications

Search outside of DiVA

GoogleGoogle Scholar
Total: 100 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 296 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf