Although NoC-based systems with many cores are commercially available, their multi-hop nature has become a bottleneck on scaling performance and energy consumption parameters. Alternatively, hybrid wireless NoC provides a postern by exploiting single-hop express links for long-distance communications. Also, there is a common wisdom that grid-like mesh is the most stable topology in conventional designs. That is why almost all of the emerging architectures had been relying on this topology as well. In this paper, first we challenge the efficiency of the grid-like mesh in emerging systems. Then, we propose HoneyWiN, a hybrid reconfigurable wireless NoC architecture that relies on the honeycomb topology. The simulation results show that on average HoneyWiN saves 17% of energy consumption while increases the network throughput by 10% compared to its wireless mesh counterpart.
Convolutional Neural Networks (CNNs) have become integral in safety-critical applications, thus raising concerns about their fault tolerance. Conventional hardwaredependent fault tolerance methods, such as Triple Modular Redundancy (TMR), are computationally expensive, imposing a remarkable overhead on CNNs. Whereas fault tolerance techniques can be applied either at the hardware level or at the model levels, the latter provides more flexibility without sacrificing generality. This paper introduces a model-level hardening approach for CNNs by integrating error correction directly into the neural networks. The approach is hardwareagnostic and does not require any changes to the underlying accelerator device. Analyzing the vulnerability of parameters enables the duplication of selective filters/neurons so that their output channels are effectively corrected with an efficient and robust correction layer. The proposed method demonstrates fault resilience nearly equivalent to TMR-based correction but with significantly reduced overhead. Nevertheless, there exists an inherent overhead to the baseline CNNs. To tackle this issue, a cost-effective parameter vulnerability based pruning technique is proposed that outperforms the conventional pruning method, yielding smaller networks with a negligible accuracy loss. Remarkably, the hardened pruned CNNs perform up to 24% faster than the hardened un-pruned ones.
The reliability of Artificial Neural Networks (ANNs) has emerged as a prominent research topic due to their increasing utilization in safety-critical applications. Long Short-Term Memory (LSTM) ANNs have demonstrated significant advantages in healthcare applications, primarily attributed to their robust processing of time-series data and memory-facilitated capabilities. This paper, for the first time, presents a comprehensive and fine-grain analysis of the resilience of LSTM-based ANNs in the context of gait analysis using fault injection into weights. Additionally, we improve their resilience by replacing faulty weights with zero, enabling ANNs to withstand environments that are up to 20 times harsher while experiencing up to 7 times fewer critical faults than an unprotected ANN.
Artificial Intelligence (AI) and, in particular, Machine Learning (ML), have emerged to be utilized in various applications due to their capability to learn how to solve complex problems. Over the past decade, rapid advances in ML have presented Deep Neural Networks (DNNs) consisting of a large number of neurons and layers. DNN Hardware Accelerators (DHAs) are leveraged to deploy DNNs in the target applications. Safety-critical applications, where hardware faults/errors would result in catastrophic consequences, also benefit from DHAs. Therefore, the reliability of DNNs is an essential subject of research. In recent years, several studies have been published accordingly to assess the reliability of DNNs. In this regard, various reliability assessment methods have been proposed on a variety of platforms and applications. Hence, there is a need to summarize the state-of-the-art to identify the gaps in the study of the reliability of DNNs. In this work, we conduct a Systematic Literature Review (SLR) on the reliability assessment methods of DNNs to collect relevant research works as much as possible, present a categorization of them, and address the open challenges. Through this SLR, three kinds of methods for reliability assessment of DNNs are identified, including Fault Injection (FI), Analytical, and Hybrid methods. Since the majority of works assess the DNN reliability by FI, we characterize different approaches and platforms of the FI method comprehensively. Moreover, Analytical and Hybrid methods are propounded. Thus, different reliability assessment methods for DNNs have been elaborated on their conducted DNN platforms and reliability evaluation metrics. Finally, we highlight the advantages and disadvantages of the identified methods and address the open challenges in the research area. We have concluded that Analytical and Hybrid methods are light-weight yet sufficiently accurate and have the potential to be extended in future research and to be utilized in establishing novel DNN reliability assessment frameworks.
The superior performance of Deep Neural Networks (DNNs) has led to their application in various aspects of human life. Safety-critical applications are no exception and impose rigorous reliability requirements on DNNs. Quantized Neural Networks (QNNs) have emerged to tackle the complexity of DNN accelerators, however, they are more prone to reliability issues.In this paper, a recent analytical resilience assessment method is adapted for QNNs to identify critical neurons based on a Neuron Vulnerability Factor (NVF). Thereafter, a novel method for splitting the critical neurons is proposed that enables the design of a Lightweight Correction Unit (LCU) in the accelerator without redesigning its computational part.The method is validated by experiments on different QNNs and datasets. The results demonstrate that the proposed method for correcting the faults has a twice smaller overhead than a selective Triple Modular Redundancy (TMR) while achieving a similar level of fault resiliency.
Deep Neural Networks (DNNs) and their accelerators are being deployed ever more frequently in safety-critical applications leading to increasing reliability concerns. A traditional and accurate method for assessing DNNs' reliability has been resorting to fault injection, which, however, suffers from prohibitive time complexity. While analytical and hybrid fault injection-/analyticalbased methods have been proposed, they are either inaccurate or specific to particular accelerator architectures. In this work, we propose a novel accurate, fine-grain, metric-oriented, and accelerator-agnostic method called DeepVigor that provides vulnerability value ranges for DNN neurons' outputs. An outcome of DeepVigor is an analytical model representing vulnerable and non-vulnerable ranges for each neuron that can be exploited to develop different techniques for improving DNNs' reliability. Moreover, DeepVigor provides reliability assessment metrics based on vulnerability factors for bits, neurons, and layers using the vulnerability ranges. The proposed method is not only faster than fault injection but also provides extensive and accurate information about the reliability of DNNs, independent from the accelerator. The experimental evaluations in the paper indicate that the proposed vulnerability ranges are 99.9% to 100% accurate even when evaluated on previously unseen test data. Also, it is shown that the obtained vulnerability factors represent the criticality of bits, neurons, and layers proficiently. DeepVigor is implemented in the PyTorch framework and validated on complex DNN benchmarks.
Deep Learning, and in particular, Deep Neural Network (DNN) is nowadays widely used in many scenarios, including safety-critical applications such as autonomous driving. In this context, besides energy efficiency and performance, reliability plays a crucial role since a system failure can jeopardize human life. As with any other device, the reliability of hardware architectures running DNNs has to be evaluated, usually through costly fault injection campaigns. This paper explores approximation and fault resiliency of DNN accelerators. We propose to use approximate (AxC) arithmetic circuits to agilely emulate errors in hardware without performing fault injection on the DNN. To allow fast evaluation of AxC DNN, we developed an efficient GPU-based simulation framework. Further, we propose a fine-grain analysis of fault resiliency by examining fault propagation and masking in networks.
Sequence alignment is the most widely used operation in bioinformatics. With the exponential growth of the biological sequence databases, searching a database to find the optimal alignment for a query sequence (that can be at the order of hundreds of millions of characters long) would require excessive processing power and memory bandwidth. Sequence alignment algorithms can potentially benefit from the processing power of massive parallel processors due their simple arithmetic operations, coupled with the inherent fine-grained and coarse-grained parallelism that they exhibit. However, the limited memory bandwidth in conventional computing systems prevents exploiting the maximum achievable speedup. In this paper, we propose a processing-in-memory architecture as a viable solution for the excessive memory bandwidth demand of bioinformatics applications. The design is composed of a set of simple and lightweight processing elements, customized to the sequence alignment algorithm, integrated at the logic layer of an emerging 3D DRAM architecture. Experimental results show that the proposed architecture results in up to 2.4x speedup and 41% reduction in power consumption, compared to a processor-side parallel implementation.
On-device transfer learning suggests fine-tuning pretrained neural networks on new input data directly on edge devices. The memory limitation of edge devices necessitates using memory-efficient fine-tuning methods. Fine-tuning involves two primary phases: the forward-pass phase and the backwardpass phase. The forward-pass phase generates output activations, and the backward-pass phase computes gradients and updates the parameters accordingly. Although the forward-pass phase demands a temporary memory to store a layer’s input and output activations, the backward-pass phase may require storing the output activations from all layers to compute gradients. This fact introduces the memory cost of the backward-pass phase as the main contributor to the huge training memory demands of deep neural networks (DNNs), which has been the focus of many studies. However, little attention has been made to how the temporary activation memory involved in the forward-pass phase may also act as the memory bottleneck, which is the main focus of this paper. This paper aims to mitigate this memory bottleneck by pruning unimportant channels from layers that require significant temporary activation memory. Experimental results demonstrate how the proposed method effectively reduces peak activation memory and total memory costs of MobileNetV2 by 65% and 59%, respectively, at the cost of 3% accuracy drop.
This paper presents a novel machine learning framework for detecting PxAF, a pathological characteristic of electrocardiogram (ECG) that can lead to fatal conditions such as heart attack. To enhance the learning process, the framework involves a generative adversarial network (GAN) along with a neural architecture search (NAS) in the data preparation and classifier optimization phases. The GAN is innovatively invoked to overcome the class imbalance of the training data by producing the synthetic ECG for PxAF class in a certified manner. The effect of the certified GAN is statistically validated. Instead of using a general-purpose classifier, the NAS automatically designs a highly accurate convolutional neural network architecture customized for the PxAF classification task. Experimental results show that the accuracy of the proposed framework exhibits a high value of 99.0% which not only enhances state-of-the-art by up to 5.1%, but also improves the classification performance of the two widely-accepted baseline methods, ResNet-18, and Auto-Sklearn, by [Formula: see text] and [Formula: see text].
This paper presents a comprehensive software-based technique that is capable of detecting soft errors in embedded systems. Soft errors can be categorized into Control Flow Errors (CFEs) and data errors. The CFEs change the flow of the program erroneously and data errors also change the results. In this paper, a new comprehensive method is presented to detect both (based on combination of authors’ previous works). In order to evaluate the proposed method, a new factor is defined that considers three main parameters simultaneously; namely fault coverage, memory overhead, and performance overhead. Since these parameters are very important in safety critical applications, they should be improved concurrently. The experimental results on SPEC2000 benchmarks show that the Evaluation Factor of the proposed method is 50% better than the Relationship Signatures for Control Flow Checking with Data Validation (RSCFCDV) methods, which are suggested in the literature.
The functionality advancements and novel customer features that are currently found in modern automotive systems require high-bandwidth and low-latency in-vehicle communications, which become even more compelling for autonomous vehicles. In a recent effort to meet these requirements, the IEEE Time-Sensitive Networking (TSN) task group has developed a set of standards that introduce novel features in Switched Ethernet. TSN standards offer, for example, a common notion of time through accurate and reliable clock synchronization, delay bounds for real-time traffic, time-driven transmissions, improved reliability, and much more. In order to fully utilize the potential of these novel protocols in the automotive domain, TSN should be seamlessly integrated into the state-of-the-art and state-of-practice model-based development processes for automotive embedded systems. Some of the core phases in these processes include software architecture modeling, timing predictability verification, simulation, and hardware realization and deployment. Moreover, throughout the development of automotive embedded systems, the safety and security requirements specified on these systems need to be duly taken into account. In this context, this work provides an overview of TSN in automotive applications and discusses the recent technological developments relevant to the adoption of TSN in automotive embedded systems. The work also points at the open challenges and future research directions.
The ever-shrinking size of a transistor has made Network on Chip (NoC) susceptible to faults. A single error in the NoC can disrupt the entire communication. In this paper, we introduce Defender, a fault-tolerant router architecture, that is capable of tolerating permanent faults in all the parts of the router. We intend to employ structural modifications in baseline router design to achieve fault tolerance. In Defender we provide the fault tolerance to the input ports and routing computation unit by grouping the neighboring ports together. Default winner strategy is used to provide fault resilience to the virtual channel arbiters and switch allocators. Multiple routes are provided to the crossbar to tolerate the faults. Defender provides improved fault tolerance to all stages of routers as compared to the currently prevailing fault tolerant router architectures. Reliability analysis using silicon protection factor (SPF) and Mean Time to Failure (MTTF) metrics confirms that our proposed design Defender is 10.78 times more reliable than baseline unprotected router and then the current state of the art architectures.
Controller Area Network (CAN) and Ethernet network are expected to co-exist in automotive industry as Ethernet provides a high-bandwidth communication, while CAN is a legacy cost-effective solution. Due to the shortcomings of conventional switched Etherent, such as determinism, IEEE Time Sensitive Networking (TSN) task group developed a set of standards to enhance the switched Ethernet technology providing low-jitter and deterministic communication. Considering these two network domains, we investigate various design approaches for a gateway that connects a CAN domain to a TSN domain. We present three gateway forwarding techniques and we develop end-to-end delay analysis methods for them. Via the analysis methods and applying them to synthetic use cases we show that the intuitive existing approach of encapsulating multiple CAN frames into a single Ethernet frame is not necessarily an efficient solution. In fact, we demonstrate several cases where it is preferable to encapsulate only one CAN frame into a TSN frame, in particular when we use a high speed TSN network. The results have a significant impact on developing such gateways as the implementation of the one-to-one frame encapsulation is considerably simpler than other complex gateway-forwarding techniques.
This paper performs a comparative evaluation of various generations of Controller Area Network (CAN), including the classical CAN, CAN Flexible Data-Rate (FD), and CAN Extra Long (XL). We utilize response-time analysis for the evaluation. In this regard, we identify that the state of the art lacks the response-time analysis for CAN XL. Hence, we discuss the worst-case transmission times calculations for CAN XL frames and incorporate them to the existing analysis for CAN to support response-time analysis of CAN XL frames. Using the extended analysis, we perform a comparative evaluation of the three generations of CAN by analyzing an automotive industrial use case. In crux, we show that using CAN FD is more advantageous than the classical CAN and CAN XL when using frames with payloads of up to 8 bytes, despite the fact that CAN XL supports higher bit rates. For frames with 12-64 bytes payloads, CAN FD performs better than CAN XL when running at the same bit rate, but CAN XL performs better when running at a higher bit rate. Additionally, we discovered that CAN XL performs better than the classical CAN and CAN FD when the frame payload is over 64 bytes, even if it runs at the same or higher bit rates than CAN FD.
The Time-Sensitive Network (TSN) amendments and protocols add capabilities on top of standard 802.1 Ethernet for guaranteeing the timeliness of both (isochronous) scheduled traffic (ST) and shaped (audio-video) communication (AVB) in distributed applications. ST streams are guaranteed via an offline computed schedule controlling the time-aware gate mechanism of IEEE 802.1Qbv, while AVB real-time streams are shaped via a credit-based shaper (CBS) and scheduler with lower-priority than ST. Although the two traffic classes use different TSN mechanisms, they are interrelated as the ST traffic class schedule influences the latency of AVB traffic. In this paper, we propose a method for the integration of the ST schedule synthesis with an analysis for the AVB class featuring IEEE 802.1Qbu frame preemption under different configurations to reduce the interference between the two classes. We first present a new worst-case response-time (WCRT) analysis for the AVB traffic class in TSN networks with preemption, considering an arbitrary number of AVB queues and different configurations for the CBS credit behavior. Then, we integrate the creation of ST schedule tables with the schedulability analysis of AVB traffic using a heuristic algorithm featuring frame preemption and a novel routing mechanism aimed at maximizing AVB schedulability. Finally, we evaluate our approach using both real-world and synthetic use cases showing the efficiency both in terms of schedule creation runtime and in terms of increasing the schedulability of lower-priority AVB traffic.
The performance of microprocessors under many modern workloads is mainly limited by the off-chip memory bandwidth. The emerging process-in-memory paradigm present a unique opportunity to reduce data movement overheads by moving computation closer to memory. State-of-the-art processing-in-memory proposals stack a logic layer on top of one or multiple memory layers in a 3D fashion and leverage the logic layer to build near-memory processing units. Such processing units are either application-specific accelerators or general-purpose cores. In this paper, we present NeuroPIM, a new processing-in-memory architecture that uses a neural network as the memory-side general-purpose accelerator. This design is mainly motivated by the observation that in many real-world applications, some program regions, or even the entire program, can be replaced by a neural network that is learned to approximate the program's output. NeuroPIM benefits from both the flexibility of general-purpose processors and superior performance of application-specific accelerators. Experimental results show that NeuroPIM provides up to 41% speedup over a processor-side neural network accelerator and up to 8x speedup over a general-purpose processor.
The non-volatile metal-oxide resistive random access memory (ReRAM) is an emerging alternative for the current memory technologies. The unique capability of ReRAM to perform analog and digital arithmetic and logic operations has enabled this technology to incorporate both computation and memory capabilities on the same unit. Due to this interesting property, there is a growing trend in recent years to implement emerging data-intensive applications on ReRAM structures. A typical ReRAM-based processing-in-memory architecture may consist tens to hundreds of ReRAM units (mats) that can either store or process data. To support such large-scale ReRAM structure, this paper proposes a scalable network-on-ReRAM architecture. The proposed network employs a novel associative router architecture, designed based on the ReRAM-based content-addressable memories. With the in-memory packet processing capability, this router yields higher throughput and resource utilization levels than a conventional router. This router is technology compatible with ReRAM and as our evaluations show, employing it to build a network-on-ReRAM makes the emerging ReRAM-based processing-in-memory architectures more scalable and performance-efficient.
This book presents and discusses innovative ideas in the design, modelling, implementation, and optimization of hardware platforms for neural networks. The rapid growth of server, desktop, and embedded applications based on deep learning has brought about a renaissance in interest in neural networks, with applications including image and speech processing, data analytics, robotics, healthcare monitoring, and IoT solutions. Efficient implementation of neural networks to support complex deep learning-based applications is a complex challenge for embedded and mobile computing platforms with limited computational/storage resources and a tight power budget. Even for cloud-scale systems it is critical to select the right hardware configuration based on the neural network complexity and system constraints in order to increase power- and performance-efficiency. Hardware Architectures for Deep Learning provides an overview of this new field, from principles to applications, for researchers, postgraduate students and engineers who work on learning-based services and hardware platforms.
For the past three decades, the interconnection network has been developed based on two major theories, one by Dally and the other by Duato. In this article, we introduce EbDa with a simplified theoretical basis, which directly allows for designing an acyclic channel dependency graph and verifying algorithms on their freedom from deadlock. EbDa is composed of three theorems that enable extracting all allowable turns without dealing with turn models.
Freedom from deadlock is one of the most important issues when designing routing algorithms in on-chip/off-chip networks. Many works have been developed upon Dally's theory proving that a network is deadlock-free if there is no cyclic dependency on the channel dependency graph. However, fnding such acyclic graph has been very challenging, which limits Dally's theory to networks with a low number of channels. In this paper, we introduce three theorems that directly lead to routing algorithms with an acyclic channel dependency graph. We also propose the partitioning methodology, enabling a design to reach the maximum adaptiveness for the n-dimensional mesh and k-ary n-cube topologies with any given number of channels. In addition, deadlock-free routing algorithms can be derived ranging from maximally fully adaptive routing down to deterministic routing. The proposed theorems can drastically remove the diffculties of designing deadlock-free routing algorithms.
Due to a high cost of 3D IC process technology, the semiconductor industry is targeting 2.5D ICs with interposer as a fast and low-cost alternative to integrate dissimilar technologies. In this paper, we propose an independent network-on-chip die, called Network-on-Die (NoD), for 2.5D ICs that operates as a communication backbone for heterogeneous many-core systems on interposer. NoD is responsible for routing packets from a source router to a destination router, and the connections between routers and cores pass through the interposer. This technique eliminates the complexity of the routing algorithms in heterogeneous systems by turning the irregular form of NoC in 2.5D ICs into a regular/optimized one in NoD. The performance evaluation is verified through RTL simulations for a heterogeneous many-core system of varying die sizes and with asymmetric shapes. We provide the theoretical justification for our simulation results.
Deep Learning (DL) has recently become a topic of study in different applications including healthcare, in which timely detection of anomalies on Electrocardiogram (ECG) can play a vital role in patient monitoring. This paper presents a comprehensive review study on the recent DL methods applied to the ECG signal for the classification purposes. This study considers various types of the DL methods such as Convolutional Neural Network (CNN), Deep Belief Network (DBN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU). From the 75 studies reported within 2017 and 2018, CNN is dominantly observed as the suitable technique for feature extraction, seen in 52% of the studies. DL methods showed high accuracy in correct classification of Atrial Fibrillation (AF) (100%), Supraventricular Ectopic Beats (SVEB) (99.8%), and Ventricular Ectopic Beats (VEB) (99.7%) using the GRU/LSTM, CNN, and LSTM, respectively
Solving Integer Linear Programming (ILP) models generally lies in the category of NP-hard problems. Therefore, as the size of ILP models grows, the efficiency of exact algorithms for solving the models reduced significantly and for large models it is not possible to have the result. Genetic Algorithm (GA) is a metaheuristic method capable of adjusting and redesigning parameters and operations according to the characteristics of ILP models. Still GA has huge search space for large models and parallelization is a suitable technique to tackle this problem. This paper presents a scalable parallel GA to solve large ILP models derived from behavioral synthesis of digital circuits. We show that although models have non-binary variables, only binary variables are sufficient for coding chromosomes. We also use 'unknown' values for some genes to decrease the likelihood of inconsistency in the encoded constraints. Our experiments verify the efficiency and scalability of the proposed algorithm on multicore platforms. The proposed method outperforms IBM ILOG CPLEX 12.6 and MI-LXPM algorithm where the ILP models include 550 to 2258 int / binary decision variables. Also, the results indicate that the saturation point of using parallel processing elements for solving the large ILP models is at least 60.
Solving Integer Linear Programming (ILP) models generally lies in the category of NP-hard problems and finding the optimal answer for large models is a computational challenge. Genetic algorithms are a family of metaheuristic algorithms capable of adjusting and redesigning parameters and operations according to the characteristics of ILP models. On the other hand, still the genetic algorithm performs a lot of operations to solve large models, and parallel processing is a suitable technique to tackle this problem. This paper introduces an LP-Relaxation based parallel genetic algorithm that uses a population-based incremental learning technique to presents an expandable solver for large ILP models derived from a behavioral synthesis of digital circuits. In the proposed algorithm, each chromosome provides a state subspace of possible solutions, and each generation is produced based on a probability vector as well as elitism. Our experiments verify the efficiency of the proposed algorithm on multicore platforms, as it outperformed four previous genetic algorithms for solving mixed integer programming problems. The proposed genetic algorithm solved 20 ILP models include up to 5183 int / binary decision variables in less than 20 min using four 16-core AMD Opteron 6386 SE processors. Also, the results indicate that for models with more than 4000 variables, the speedup and the efficiency of the proposed parallel genetic algorithm on 60 CPU cores is more than 18X and 30%, respectively.
Parallel hardware accelerators for large-scale neural networks typically consist of several processing nodes, arranged as a multi- or many-core system-on-chip, connected by a network-on-chip (NoC). Recent proposals also benefit from the emerging 3D memory-on-logic architectures to provide sufficient bandwidth for neural networks. Handling the heavy traffic between neurons and memory and also the multicast-based inter-neuron traffic, which often varies over time, is the most challenging design consideration for the networks-on-chip in such accelerators. To address these issues, a reconfigurable network-on-chip architecture for 3D memory-on-logic neural network accelerators is presented in this paper. The reconfigurable NoC can adapt its topology to the on-chip traffic patterns. It can be also configured as a tree-like structure to support multicast-based neuron-to-neuron and memory-to-neuron traffic of neural networks. The evaluation results show that the proposed architecture can better manage the multicast-based traffic of neural networks than some state-of-the-art topologies and considerably increase throughput and power efficiency.
In this paper, we discuss challenges when using neural networks (NNs) in safety-critical applications. We address the challenges one by one, with aviation safety in mind. We then introduce a possible implementation to overcome the challenges. Only a small portion of the solution has been implemented physically and much work is considered as future work. Our current understanding is that a real implementation in a safety-critical system would be extremely difficult. Firstly, to design the intended function of the NN, and secondly, designing monitors needed to achieve a deterministic and fail-safe behavior of the system. We conclude that only the most valuable implementations of NNs should be considered as meaningful to implement in safety-critical systems.
Time-Sensitive Networking (TSN) is a set of ongoing projects within the IEEE standardization to guarantee timeliness and low-latency communication based on switched Ethernet for industrial applications. The huge demand is mainly coming from industries where intensive data transmission is required, such as in the modern vehicles where cameras, lidars and high-bandwidth modern sensors are connected. The TSN standards are evolving over time, hence the hardware needs to change depending upon the modifications. In addition, high performance hardware is required to obtain a full benefit from the standards. In this paper, we present a research plan for developing novel techniques to support a parameterized and modular hardware IP core of the multi-stage TSN switch fabric in VHSIC (Very High Speed Integrated Circuit) Hardware Description Language (VHDL), which can be deployed in any Field-Programmable-Gate-Array (FPGA) devices. We present the challenges on the way towards the mentioned goal.
In this chapter, we present ClosNN, a specialized NoC for NNs based on the well-known Clos topology. Clos is perhaps the most popular Multistage Interconnection Network (MIN) topology. Clos is used commonly as a base of switching infrastructures in various commercial telecommunication and network routers and switches.
This paper presents a schedulability analysis for the Best-Effort (BE) traffic class within Time-Sensitive Networking (TSN) networks. The presented analysis considers several features in the TSN standards, including the Credit-Based Shaper (CBS), the Time-Aware Shaper (TAS), and the frame preemption. Although the BE class in TSN is primarily used for the traffic with no strict timing requirements, some industrial applications prefer to utilize this class for the non-hard real-time traffic instead of classes that use the CBS. The reason mainly lies in the fact that the complexity of TSN configuration becomes significantly high when the time-triggered traffic via the TAS and other classes via the CBS are used altogether. We demonstrate the applicability of the presented analysis on a vehicular application use case. We show that a network designer can get information on the schedulability of the BE traffic, based on which the network configuration can be further refined with respect to the application requirements.
In this paper, we present a bandwidth reservation analysis for Audio-Video Bridging (AVB) traffic in the Time-Sensitive Networking (TSN) standards. The proposed analysis is based on the existing worst-case response-time analysis and can be used to calculate the minimum required bandwidth for guaranteeing the schedulability of messages in AVB classes. The proposed analysis allocates per-link bandwidth to AVB traffic that is sufficient to ensure its schedulability when a combination of the Credit-Based Shaper and Time-Aware Shaper mechanisms are used. We evaluate the proposed analysis using an automotive industrial use case. We evaluate the schedulability of AVB traffic by comparing the proposed analysis with the utilization-based bandwidth reservation as recommended by the TSN standards.
In this paper, we identify that the existing end-to-end data propagation delay analysis for distributed embedded systems can calculate pessimistic (over-estimated) analysis results when the nodes are synchronized. This is particularly the case of the Scheduled Traffic (ST) class in Time-sensitive Networking (TSN), which is scheduled offline according to the IEEE 802.1Qbv standard and the nodes are synchronized according to the IEEE 802.1AS standard. We present a comprehensive system model for distributed embedded systems that incorporates all of the above mentioned aspect as well as all traffic classes in TSN. We extend the analysis to support both synchronization and non-synchronization among the ECUs as well as offline schedules on the networks. The extended analysis can now be used to analyze all traffic classes in TSN when the nodes are synchronized without introducing any pessimism in the analysis results. We evaluate the proposed model and the extended analysis on a vehicular industrial use case.
End-to-end data-propagation delay analysis allows verification of important timing constraints, such as age and reaction, that areoften specified on chains of tasks and messages in real-time systems.We identify that the existing analysis does not support distributed taskchains that include the Time-Sensitive Networking (TSN) messages. Tothis end, this paper extends the existing analysis to allow the end-to-endtiming analysis of distributed task chains that include TSN messages.The extended analysis supports all types of traffic in TSN, includingthe Scheduled Traffic (ST), Audio Video Bridging (AVB), and BestEffort (BE) traffic. Furthermore, the extended analysis accounts for thesynchronization among the end stations that are connected via TSN.The applicability of the analysis is demonstrated using an automotiveapplication case study.
The IEEE Time-Sensitive Networking (TSN) standards' amendment 802.1Qbv provides real-time guarantees for Scheduled Traffic (ST) streams by the Time Aware Shaper (TAS) mechanism. In this paper, we develop offline schedule optimization objective functions to configure the TAS for ST streams, which can be effective to achieve a high Quality of Service (QoS) of lower priority Best-Effort (BE) traffic. This becomes useful if real-time streams from legacy protocols are configured to be carried by the BE class or if the BE class is used for value-added (but non-critical) services. We present three alternative objective functions, namely Maximization, Sparse and Evenly Sparse, followed by a set of constraints on ST streams. Based on simulated stream traces in OMNeT++/INET TSN NeSTiNg simulator, we compare our proposed schemes with a most commonly applied objective function in terms of overall maximum end-to-end delay and deadline misses of BE streams. The results confirm that changing the schedule synthesis objective to our proposed schemes ensures timely delivery and lower end-to-end delays in BE streams.
This paper investigates the effects of various parameters of high priority traffic classes on the Best Effort (BE) traffic in the networks based on the IEEE Time Sensitive Networking (TSN) standards. In this regard, the paper discusses ongoing work and presents preliminary results using a TSN simulator. The results indicate that several parameters of the high priority traffic such as periods, offsets and preemption modes can have a significant impact on the quality of service (e.g., guaranteed message delivery and message delays) of the BE traffic.
In this paper, we present an end-to-end timing model to capture timing information from software architectures of distributed embedded systems that use network communication based on the Time-Sensitive Networking (TSN) standards. Such a model is required as an input to perform end-to-end timing analysis of these systems. Furthermore, we present a methodology that aims at automated extraction of instances of the end-to-end timing model from component-based software architectures of the systems and the TSN network configurations. As a proof of concept, we implement the proposed end-to-end timing model and the extraction methodology in the Rubus Component Model (RCM) and its tool chain Rubus-ICE that are used in the vehicle industry. We demonstrate the usability of the proposed model and methodology by modeling a vehicular industrial use case and performing its timing analysis.
Extraction of end-to-end timing information from software architectures of vehicular systems to support their timing analysis is a daunting challenge. To address this challenge, this paper presents a systematic method to extract this information from vehicular software architectures that can be distributed over several electronic control units connected by Time-Sensitive Networking (TSN) networks. As a proof of concept, the proposed extraction method is applied to an industrial component model, namely the Rubus Component Model (RCM), and its toolchain. Furthermore, the usability of the proposed method is demonstrated in an industrial use case from the vehicular domain.
Designing and simulating large networks, based on the Time-Sensitive Networking (TSN) standards, require complex and demanding configuration at the design and pre-simulation phases. The existing configuration and simulation frameworks support only the manual configuration of TSN networks. This hampers the applicability of these frameworks to large-sized TSN networks, especially in complex industrial embedded system applications. This paper proposes a modular framework to automate offline scheduling in TSN networks to facilitate the design time and pre-simulation automated network configurations as well as interpretation of the simulations. To demonstrate and evaluate the applicability of the proposed framework, a large TSN network is automatically configured and its performance is evaluated by measuring end-to-end delays of time-critical flows in a state-of-the-art simulation framework, namely NeSTiNg.
Protocols enable things to connect and communicate, thus making the Internet of Things possible. The performance aspect of the Internet of Things protocols, vital to its widespread utilization, have received much attention. However, one aspect of IoT protocols, essential to its adoption in the real world, is a protocols' feature set. Comparative analysis based on competing features and properties are rarely if ever, discussed in the literature. In this paper, we define 19 attributes in 5 categories that are essential for IoT stakeholders to consider. These attributes are then used to contrast four IoT protocols, MQTT, HTTP, CoAP and XMPP. Furthermore, we discuss scenarios where an assessment based on comparative strengths and weaknesses would be beneficial. The provided comparison model can be easily extended to include protocols like MQTT-SN, AMQP and DDS.
In Machine Learning systems, several factors impact the performance of a trained model. The most important ones include model architecture, the amount of training time, the dataset size and diversity. In the realm of safety-critical machine learning the used datasets need to reflect the environment in which the system is intended to operate, in order to minimize the generalization gap between trained and real-world inputs. Datasets should be thoroughly prepared and requirements on the properties and characteristics of the collected data need to be specified. In our work we present a case study in which generating a synthetic dataset is accomplished based on real-world flight data from the ADS-B system, containing thousands of approaches to several airports to identify real-world statistical distributions of relevant variables to vary within our dataset sampling space. We also investigate what the effects are of training a model on synthetic data to different extents, including training on translated image sets (using domain adaptation). Our results indicate airport location to be the most critical parameter to vary. We also conclude that all experiments did benefit in performance from pre-training on synthetic data rather than using only real data, however this did not hold true in general for domain adaptation-translated images.
In Machine Learning systems, several factors impact the performance of a trained model. The most important ones include model architecture, the amount of training time, the dataset size and diversity. We present a method for analyzing datasets from a use-case scenario perspective, detecting and quantifying out-of-distribution (OOD) data on dataset level. Our main contribution is the novel use of similarity metrics for the evaluation of the robustness of a model by introducing relative Fréchet Inception Distance (FID) and relative Kernel Inception Distance (KID) measures. These relative measures are relative to a baseline in-distribution dataset and are used to estimate how the model will perform on OOD data (i.e. estimate the model accuracy drop). We find a correlation between our proposed relative FID/relative KID measure and the drop in Average Precision (AP) accuracy on unseen data.
Nowadays, many modern applications, e.g. autonomous system, and cloud data services need to capture and process a big amount of raw data at runtime that ultimately necessitates a high-performance computing model. Deep Neural Network (DNN) has already revealed its learning capabilities in runtime data processing for modern applications. However, DNNs are becoming more deep sophisticated models for gaining higher accuracy which require a remarkable computing capacity. Considering high-performance cloud infrastructure as a supplier of required computational throughput is often not feasible. Instead, we intend to find a near-sensor processing solution which will lower the need for network bandwidth and increase privacy and power efficiency, as well as guaranteeing worst-case response-times. Toward this goal, we introduce ADONN framework, which aims to automatically design a highly robust DNN architecture for embedded devices as the closest processing unit to the sensors. ADONN adroitly searches the design space to find improved neural architectures. Our proposed framework takes advantage of a multi-objective evolutionary approach, which exploits a pruned design space inspired by a dense architecture. Unlike recent works that mainly have tried to generate highly accurate networks, ADONN also considers the network size factor as the second objective to build a highly optimized network fitting with limited computational resource budgets while delivers comparable accuracy level. In comparison with the best result on CIFAR-10 dataset, a generated network by ADONN presents up to 26.4 compression rate while loses only 4% accuracy. In addition, ADONN maps the generated DNN on the commodity programmable devices including ARM Processor, High-Performance CPU, GPU, and FPGA.
Autonomous vehicles have a great influence on our life. These vehicles are more convenient, more energy efficient providing higher safety level and cheaper driving solutions. In addition, decreasing the generation of CO 2 , and the risk vehicular accidents are other benefits of autonomous vehicles. However, leveraging a full autonomous system is challenging and the proposed solutions are newfound. Providing a testbed for evaluating new algorithms is beneficial for researchers and hardware developers to verify the real impact of their solutions. The existence of testing environment is a low-cost infrastructure leading to increase the time-to-market of novel ideas. In this paper, we propose Auto Rio, a cutting-edge indoor testbed for developing autonomous vehicles.
Ternary Neural Networks (TNNs) compress network weights and activation functions into 2-bit representation resulting in remarkable network compression and energy efficiency. However, there remains a significant gap in accuracy between TNNs and full-precision counterparts. Recent advances in Neural Architectures Search (NAS) promise opportunities in automated optimization for various deep learning tasks. Unfortunately, this area is unexplored for optimizing TNNs. This paper proposes TAS, a framework that drastically reduces the accuracy gap between TNNs and their full-precision counterparts by integrating quantization into the network design. We experienced that directly applying NAS to the ternary domain provides accuracy degradation as the search settings are customized for full-precision networks. To address this problem, we propose (i) a new cell template for ternary networks with maximum gradient propagation; and (ii) a novel learnable quantizer that adaptively relaxes the ternarization mechanism from the distribution of the weights and activation functions. Experimental results reveal that TAS delivers 2.64% higher accuracy and 2.8x memory saving over competing methods with the same bit-width resolution on the CIFAR-10 dataset. These results suggest that TAS is an effective method that paves the way for the efficient design of the next generation of quantized neural networks.
Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains. Due to their computational complexity, DNNs benefit from implementations that utilize custom hardware accelerators to meet performance and response time as well as classification accuracy constraints. In this paper, we propose DeepMaker framework that aims to automatically design a set of highly robust DNN architectures for embedded devices as the closest processing unit to the sensors. DeepMaker explores and prunes the design space to find improved neural architectures. Our proposed framework takes advantage of a multi-objective evolutionary approach that exploits a pruned design space inspired by a dense architecture. DeepMaker considers the accuracy along with the network size factor as two objectives to build a highly optimized network fitting with limited computational resource budgets while delivers an acceptable accuracy level. In comparison with the best result on the CIFAR-10 dataset, a generated network by DeepMaker presents up to a 26.4x compression rate while loses only 4% accuracy. Besides, DeepMaker maps the generated CNN on the programmable commodity devices, including ARM Processor, High-Performance CPU, GPU, and FPGA.
Stereo vision cameras are flexible sensors due to providing heterogeneous information such as color, luminance, disparity map (depth), and shape of the objects. Today, Convolutional Neural Networks (CNNs) present the highest accuracy for the disparity map estimation [1]. However, CNNs require considerable computing capacity to process billions of floating-point operations in a real-time fashion. Besides, commercial stereo cameras produce huge size images (e.g., 10 Megapixels [2]), which impose a new computational cost to the system. The problem will be pronounced if we target resource-limited hardware for the implementation. In this paper, we propose DenseDisp, an automatic framework that designs a Siamese neural architecture for disparity map estimation in a reasonable time. DenseDisp leverages a meta-heuristic multi-objective exploration to discover hardware-friendly architectures by considering accuracy and network FLOPS as the optimization objectives. We explore the design space with four different fitness functions to improve the accuracy-FLOPS trade-off and convergency time of the DenseDisp. According to the experimental results, DenseDisp provides up to 39.1x compression rate while losing around 5% accuracy compared to the state-of-the-art results.