https://www.mdu.se/

mdu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Run Time Memory Error Recovery Process in Networking System.
Mälardalen University, School of Innovation, Design and Engineering, Embedded Systems. ERICSSON AB, Sweden. (ARRAY++)ORCID iD: 0000-0003-2598-6796
ERICSSON AB, Sweden.ORCID iD: 0000-0002-3755-562X
ERICSSON AB, Sweden.ORCID iD: 0000-0002-3755-562X
ERICSSON AB, Sweden. (ACICS)ORCID iD: 0000-0003-2612-4135
Show others and affiliations
2023 (English)In: 7th IEEE International Conference on System Reliability and Safety / [ed] IEEE, Bologna, Italy: IEEE conference proceedings, 2023, p. 590-597Conference paper, Published paper (Refereed)
Abstract [en]

System memory errors have always been problematic; today, they cause more than forty percent of confirmed hardware errors in repair centers for both data centers and telecommunications network nodes. Therefore, it is somewhat expected that, in recent years, device manufacturers improved the hardware features to support hardware-assisted fault management implementation. For example, the new standard, DDR5, includes both data redundancy, the so-called Error Correcting Code (ECC), and physical redundancy, the post-package repair (PPR), as mandatory features. Production and repair centers mainly use physical redundancy to replace faulty memory rows. In contrast, field use still needs to be improved, mainly due to a need for integrated system solutions for network nodes. This paper aims to compensate for this shortcoming and presents a system solution for handling memory errors. It is a multi-technology proposition (mixed use of ECC and PPR) based on multi-layer (hardware, firmware, and software) error information exchange.

Place, publisher, year, edition, pages
Bologna, Italy: IEEE conference proceedings, 2023. p. 590-597
Keywords [en]
Memory Faults, Fault Management, Post-Package Repair, Error Correcting Code, Run Time Fault Recovering
National Category
Telecommunications
Identifiers
URN: urn:nbn:se:mdh:diva-66225DOI: 10.1109/ICSRS59833.2023.10381346Scopus ID: 2-s2.0-85183463653ISBN: 979-8-3503-0606-4 (print)OAI: oai:DiVA.org:mdh-66225DiVA, id: diva2:1843491
Conference
ICSRS2023
Available from: 2024-03-11 Created: 2024-03-11 Last updated: 2024-04-10Bibliographically approved
In thesis
1. The role of fault management in the embedded system design
Open this publication in new window or tab >>The role of fault management in the embedded system design
2024 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

In the last decade, the world of telecommunications has seen the value ofservices definitively affirmed and the loss of the connectivity value. This changeof pace in the use of the network (and available hardware resources) has ledto continuous, unlimited growth in data traffic, increased incomes for serviceproviders, and a constant erosion of operators’ incomes for voice and ShortMessage Service (SMS) traffic.The change in mobile service consumption is evident to operators. Themarket today is in the hands of over the top (OTT) media content deliverycompanies (Google, Meta, Netflix, Amazon, etc.), and The fifth generation ofmobile networks (5G), the latest generation of mobile architecture, is nothingother than how operators can invest in system infrastructure to participate in theprosperous service business.With the advent of 5G, the worlds of cloud and telecommunications havefound their meeting point, paving the way for new infrastructures and ser-vices, such as smart cities, industry 4.0, industry 5.0, and Augmented Reality(AR)/Virtual Reality (VR). People, infrastructures, and devices are connected toprovide services that we even struggle to imagine today, but a highly intercon-nected system requires high levels of reliability and resilience.Hardware reliability has increased since the 1990s. However, it is equallycorrect to mention that the introduction of new technologies in the nanometerdomain and the growing complexity of on-chip systems have made fault man-agement critical to guarantee the quality of the service offered to the customerand the sustainability of the network infrastructure.

In this thesis, our first contribution is a review of the fault managementimplementation framework for the radio access network domain. Our approachintroduces a holistic vision in fault management where there is increasingly moresignificant attention to the recovery action, the crucial target of the proposedframework. A new contribution underlines the attention toward the recoverytarget: we revisited the taxonomy of faults in mobile systems to enhance theresult of the recovery action, which, in our opinion, must be propagated betweenthe different layers of an embedded system ( hardware, firmware, middleware,and software). The practical adoption of the new framework and the newtaxonomy allowed us to make a unique contribution to the thesis: the proposalof a new algorithm for managing system memory errors, both temporary (soft)and permanent (hard)The holistic vision of error management we introduced in this thesis involveshardware that proactively manages faults. An efficient implementation of faultmanagement is only possible if the hardware design considers error-handlingtechniques and methodologies. Another contribution of this thesis is the def-inition of the fault management requirements for the RAN embedded systemhardware design.Another primary function of the proposed fault management framework isfault prediction. Recognizing error patterns means allowing the system to reactin time, even before the error condition occurs, or identifying the topology of theerror to implement more targeted and, therefore, more efficient recovery actions.The operating temperature is always a critical characteristic of embedded radioaccess network systems. Base stations must be able to work in very differenttemperature conditions. However, the working temperature also directly affectsthe probability of error for the system. In this thesis, we have also contributed interms of a machine-learning algorithm for predicting the working temperature ofbase stations in radio access networks — a first step towards a more sophisticatedimplementation of error prevention and prediction.

Place, publisher, year, edition, pages
Västerås: Mälardalens universitet, 2024
Series
Mälardalen University Press Licentiate Theses, ISSN 1651-9256 ; 357
Keywords
Fault Management, Resilient system, Recovery methodology.
National Category
Telecommunications
Research subject
Computer Science
Identifiers
urn:nbn:se:mdh:diva-66227 (URN)978-91-7485-639-2 (ISBN)
Presentation
2024-04-18, Milos, Mälardalens universitet, Västerås, 13:15 (English)
Opponent
Supervisors
Available from: 2024-03-11 Created: 2024-03-11 Last updated: 2024-03-28Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopushttps://ieeexplore.ieee.org/document/10381346

Authority records

Vitucci, CarloNolte, Thomas

Search in DiVA

By author/editor
Vitucci, CarloDanielsson, JakobDanielsson, JakobJägemar, MarcusLarsson, AlfNolte, Thomas
By organisation
Embedded Systems
Telecommunications

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 110 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf