The increased proliferation of automotive systems is leading to a paradigm shift in the automotive system architecture. Several, now distributed, applications will be consolidated on fewer, more powerful platforms, containing tens or hundreds of compute cores. Clustered many-core processors are a promising candidate for such systems, since each cluster provides enough computational power to host complex applications, while their intrinsic hardware architecture isolates different cluster from each other. The described PhD project works towards methods that allow the consolidation of automotive applications on clustered many-core architectures, while all their timing requirements are maintained. A contention-free execution framework is proposed that successfully diminishes the access-delays due to contention on shared resources within a cluster. In order to integrate complex end-to-end constraints on multi-rate chains, a method is proposed that allows the analysis of such chains and generates job-level dependencies. Such job-level dependencies can then be used to integrate the end-to-end constraints into the proposed execution framework. The applicability of the proposed methods to industrial problems is demonstrated via industrial case studies.
Automotive systems have transitioned from basic transportation utilities to sophisticated systems. The rapid increase in functionality comes along with a steep increase in software complexity. This manifests itself in a surge of the number of functionalities as well as the complexity of existing functions. To cope with this transition, current trends shift away from today’s distributed architectures towards integrated architectures, where previously distributed functionality is consolidated on fewer, more powerful, computers. This can ease the integration process, reduce the hardware complexity, and ultimately save costs.
One promising hardware platform for these powerful embedded computers is the many-core processor. A many-core processor hosts a vast number of compute cores, that are partitioned on tiles which are connected by a Network-on-Chip. These natural partitions can provide exclusive execution spaces for different applications, since most resources are not shared among them. Hence, natural building blocks towards temporally and spatially separated execution spaces exist as a result of the hardware architecture.
Additionally to the traditional task local deadlines, automotive applications are often subject to timing constraints on the data propagation through a chain of semantically related tasks. Such requirements pose challenges to the system designer as they are only able to verify them after the system synthesis (i.e. very late in the design process).
In this thesis, we present methods that transform complex timing constraints on the data propagation delay to precedence constraints between individual jobs. An execution framework for the cluster of the many-core is proposed that allows access to cluster external memory while it avoids contention on shared resources by design. A partitioning and configuration of the Network-on-Chip provides isolation between the different applications and reduces the access time from the clusters to external memory. Moreover, methods that facilitate the verification of data propagation delays in each development step are provided.
The increased complexity of today’s industrial embedded systems stands inneed for more computational power while most systems must adhere to a restrictedenergy consumption, either to prolong the battery lifetime or to reduceoperational costs. The many-core processor is therefore a natural fit. Due tothe simple architecture of the compute cores, and therefore their good analyzability,such processors are additionally well suited for real-time applications.In our research, we focus on two particular problems which need to be addressedin order to pave the way into the many-core era. The first area is powerand thermal aware execution frameworks, where we present different energyaware extensions to well known load balancing algorithms, allowing them todynamically scale the number of active cores depending on their workload.In contrast, an additional framework is presented which balances workloadsto minimize temperature gradients on the die. The second line of works focuseson industrial standards in the face of massively parallel platforms, wherewe address the automotive and automation domain. We present an executionframework for IEC 61131-3 applications, allowing the consolidation of severalIEC 61131-3 applications on the same platform. Additionally, we discussseveral architectural options for the AUTOSAR software architecture on suchmassively parallel platforms.
As of today, AUTOSAR is the de facto standard in the automotive industry, providing a common software architecture and development process for automotive applications. While this standard is originally written for singlecore operated Electronic Control Units (ECU), new guidelines and recommendations have been added recently to provide support for multicore architectures. This update came as a response to the steady increase of the number and complexity of the software functions embedded in modern vehicles, which call for the computing power of multicore execution environments. In this paper, we enumerate and analyze the design options and the challenges of porting AUTOSAR-based automotive applications onto multicore platforms. In particular, we investigate those options when considering the emerging many-core architectures that provide a more 'scalable' environment than the traditional multicore systems. Such platforms are suitable to enable massive parallel execution, and their design is more suitable for partitioning and isolating the software components.
Software design for automotive systems is highly complex due to the presence of strict data age constraints for event chains in addition to task specific requirements. These age constraints define the maximum time for the propagation of data through an event chain consisting of independently triggered tasks. Tasks in event chains can have different periods, introducing over- and under-sampling effects, which additionally aggravates their timing analysis. Furthermore, different functionality in these systems, is developed by different suppliers before the final system integration on the ECU. The software itself is developed in a hardware agnostic manner and this uncertainty and limited information at the early design phases may not allow effective analysis of end-to-end delays during that phase. In this paper, we present a method to compute end-to-end delays given the information available in the design phases, thereby enabling timing analysis throughout the development process. The presented methods are evaluated with extensive experiments where the decreasing pessimism with increasing system information is shown.
Automotive embedded systems are subjected to stringent timing requirements that need to be verified. One of the most complex timing requirement in these systems is the data age constraint. This constraint is specified on cause- effect chains and restricts the maximum time for the propagation of data through the chain. Tasks in a cause-effect chain can have different activation patterns and different periods, that introduce over- and under-sampling effects, which additionally aggravate the end-to-end timing analysis of the chain. Furthermore, the level of timing information available at various development stages (from modeling of the software architecture to the software implementation) varies a lot, the complete timing information is available only at the implementation stage. This uncertainty and limited timing information can restrict the end-to-end timing analysis of these chains. In this paper, we present methods to compute end-to-end delays based on different levels of system information. The characteristics of different communication semantics are further taken into account, thereby enabling timing analysis throughout the development process of such heterogeneous software systems. The presented methods are evaluated with extensive experiments. As a proof of concept, an industrial case study demonstrates the applicability of the proposed methods following a state-of-the-practice development process.
Many industrial embedded systems have timing con- straints on the data propagation through a chain of independent tasks. These tasks can execute at different periods which leads to under and oversampling of data. In such situations, understand- ing and validating the temporal correctness of end-to-end delays is not trivial. Many industrial areas further face distributed development where different functionalities are integrated on the same platform after the development process. The large effect of scheduling decisions on the end-to-end delays can lead to expensive redesigns of software parts due to the lack of analysis at early design stages. Job-level dependencies is one solution for this challenge and means of scheduling such systems are available. In this paper we present MECHAniSer, a tool targeting the early analysis of end-to-end delays in multi-rate cause effect chains with specified job-level dependencies. The tool further provides the possibility to synthesize job-level dependencies for a set of cause-effect chains in a way such that all end-to-end requirements are met. The usability and applicability of the tool to industrial problems is demonstrated via a case study.
Todays automotive embedded systems comprise a multitude of functionalities, many with complex timing re- quirements. Besides task specific timing requirements, such ap- plications often have timing requirements for the propagation of data through a chain of tasks. An important metric for control applications is the data age, which is addressed in this work. The analysis of such systems is non-trivial because tasks involved in the data propagation may execute at different periods, which leads to over and undersampling within one chain. This work presents a novel method to compute worst- and best-case end-to-end latencies for such systems. A second contribution synthesizes job-level dependencies for such task sets in a way that data paths which exceed the age constraint are eliminated. An extensive evaluation is performed on synthetic task sets and the applicability to industrial applications is demonstrated in a case study.
The majority of embedded control systems are modeled with several chains of independently triggered tasks, also known as multi-rate effect chains. These chains have often stringent end-to-end timing requirements that should be satisfied before running the system. MECHAniSer is one of the tools that supports end-to-end timing analysis of such chains. In addition, the tool provides the possibility to synthesize job-level dependencies for these chains such that all end-to-end timing requirements are satisfied. In this paper we showcase an extension of MECHAniSer that supports the analysis of mixed chains that contain a mix of independent and dependent tasks.
As of today, AUTOSAR is the de facto standard inthe automotive industry, providing a common software architectureand development process for automotive applications. Whilethis standard is originally written for singlecore operated ElectronicControl Units (ECU), new guidelines and recommendationshave been added recently to provide support for multicore architectures.This update came as a response to the steady increase ofthe number and complexity of the software functions embedded inmodern vehicles, which call for the computing power of multicoreexecution environments. In this paper, we enumerate and analyzethe design options and the challenges of porting AUTOSAR-basedautomotive applications onto multicore platforms. In particular,we investigate those options when considering the emerging manycorearchitectures that provide a more scalable environment thanthe traditional multicore systems. Such platforms are suitableto enable massive parallel execution, and their design is moresuitable for partitioning and isolating the software components.
Next generations of compute-intensive real-time applications in automotive systems will require more powerful computing platforms. One promising power-efficient solution for such applications is to use clustered many-core architectures. However, ensuring that real-time requirements are satisfied in the presence of contention in shared resources, such as memories, remains an open issue. This work presents a novel contention-free execution framework to execute automotive applications on such platforms. Privatization of memory banks together with defined access phases to shared memory resources is the backbone of the framework. An Integer Linear Programming (ILP) formulation is presented to find the optimal time-triggered schedule for the on-core execution as well as for the access to shared memory. Additionally a heuristic solution is presented that generates the schedule in a fraction of the time required by the ILP. Extensive evaluations show that the proposed heuristic performs only 0.5% away from the optimal solution while it outperforms a baseline heuristic by 67%. The applicability of the approach to industrially sized problems is demonstrated in a case study of a software for Engine Management Systems.
Technological advances have increased the transistor density, thereby ushering in multi- and more recently many-core systems, distinguished by the presence of hundreds of cores on a single chip. For such a platform, the Network-on-Chip (NoC) has emerged as a scalable and efficient interconnect fabric to realize the communication across an ever increasing number of processor cores, memories, and specialized IP blocks both on- and off-chip. In this paper, we highlighted some key problems in NoC based architectures that must be addressed before the deployment of real-time applications onto these platforms becomes possible. A paradigm shift from function centric to data and communication centric approaches is required. Combining hardware and software based flow-regulation seems to be the only way to ensure that NoCs go beyond the best-effort service and address the requirements of diverse applications.
Fixed Priority Scheduling (FPS) is the de facto standard in industry and it is the scheduling algorithm used in OSEK/AUTOSAR. Applications in such systems are compositions of so called runnables, the functional entities of the system. Runnables are mapped to operating system tasks during system synthesis. In order to improve system performance it is proposed to execute runnables non-preemptively while varying the tasks threshold between runnables. This allows simpler resource access, can reduce the stack usage of the system, and improve the schedulability of the task sets. FPDS , as a special case of fixed-priority scheduling with deferred preemptions, executes subjobs non-preemptively and preemption points have preemption thresholds, providing exactly the proposed behavior. However OSEK/AUTOSAR-conform systems cannot execute such schedules. In this paper we present an approach allowing the execution of FPDS schedules. In our approach we exploit pseudo resources in order to implement FPDS . It is further shown that our optimal algorithm produces a minimum number of resource accesses. In addition, a simulation based evaluation is presented in which the number of resource accesses as well as the number of required pseudo-resources by the proposed algorithms are investigated. Finally, we report the overhead of resource access primitives using our measurements performed on an AUTOSARcompliant operating system.
Model-based development and component-based software engineering have emerged as a promising approach to deal with enormous software complexity in automotive systems. This approach supports the development of software architectures by interconnecting (and reusing) software components (SWCs) at various abstraction levels. Automotive software architectures are often modeled with chains of SWCs, also called cause-effect chains that are constrained by timing requirements. Based on the variations in activation patterns of SWCs, a single model of a cause-effect chain at a higher abstraction level can conform to several valid refined models of the chain at a lower abstraction level, which is closer to the system implementation. As a consequence, the total number of valid implementation-level models generated by the existing techniques increases exponentially, thereby significantly increasing the runtime of the timing analysis engines and liming the scalability of the existing techniques. This paper computes an upper bound on the activation pattern combinations that may result from a system of cause-effect chains in a given high-level model of the software architecture. An efficient algorithm is presented that traverses only a reduced number of possible combinations of the cause-effect chains, resulting in the timing analysis of significantly lower number of implementation-level models of the software architecture. A proof of concept is provided by conducting a case study that shows significant reduction in the runtime of timing analysis engines, i.e., the timing behavior of the considered system is verified by performing the timing analysis of only 27% of all possible combinations of the cause-effect chains.
Developing automotive software is becoming in- creasingly challenging due to continuous increase in its size and complexity. The development challenge is amplified when the industrial requirements dictate extensions to the legacy (previously developed) automotive software while requiring to meet the existing timing requirements. To cope with these challenges, sufficient techniques and tooling to support the modeling and timing analysis of such systems at earlier development phases is needed. Within this context, we focus on the extension of software component chains in the software architectures of automotive legacy systems. Selecting the sampling frequency, i.e. period, for newly added software components is crucial to meet the timing requirements of the chains. The challenges in selecting periods are identified. It is further shown how to automatically assign periods to software components, such that the end-to-end timing requirements are met while the runtime overhead is minimized. An industrial case study is presented that demonstrates the applicability of the proposed solution to industrial problems.
A majority of multi-rate real-time systems are constrained by a multitude of timing requirements, in addition to the traditional deadlines on well-studied response times. This means, the timing predictability of these systems not only depends on the schedulability of certain task sets but also on the timely propagation of data through the chains of tasks from sensors to actuators. In the automotive industry, four different timing constraints corresponding to various data propagation delays are commonly specified on the systems. This paper identifies and addresses the source of pessimism as well as optimism in the calculations for one such delay, namely the reaction delay, in the state-of-the-art analysis that is already implemented in several industrial tools. Furthermore, a generic framework is proposed to compute all the four end-to-end data propagation delays, complying with the established delay semantics, in a scheduler and hardware-agnostic manner. This allows analysis of the system models already at early development phases, where limited system information is present. The paper further introduces mechanisms to generate job-level dependencies, a partial ordering of jobs, which need to be satisfied by any execution platform in order to meet the data propagation timing requirements. The job-level dependencies are first added to all task chains of the system and then reduced to its minimum required set such that the job order is not affected. Moreover, a necessary schedulability test is provided, allowing for varying the number of CPUs. The experimental evaluations demonstrate the tightness in the reaction delay with the proposed framework as compared to the existing state-of-the-art and practice solutions.
Access to shared memory is one of the main chal- lenges for many-core processors. One group of scheduling strategies for such platforms focuses on the division of tasks access to shared memory and code execution. This allows to orchestrate the access to shared local and off-chip memory in a way such that access contention between different compute cores is avoided by design. In this work, an execution framework is introduced that leverages local memory by statically allocating a subset of tasks to cores. This reduces the access times to shared memory, as off-chip memory access is avoided, and in turn improves the schedulability of such systems. A Constrained Programming (CP) formulation is presented to selects the statically allocated tasks and generates the complete system schedule. Evaluations show that the pro- posed approach yields an up to 21% higher schedulability ratio than related work, and a case study demonstrates its applicability to industrial problems.
Many-core processors can provide the computational power required by future complex embedded systems. However, their adoption is not trivial, since several sources of interference on COTS many-core platforms have adverse effects on the resulting performance. One main source of performance degradation is the contention on the Network-on-Chip, which is used for communication among the compute cores via the off- chip memory. Available analysis techniques for the traversal time of messages on the NoC do not consider many of the architectural features found on COTS platforms. In this work, we target a state-of-the-art many-core processor, the Kalray MPPA R . A novel partitioning strategy for reducing the contention on the NoC is proposed. Further, we present an analysis technique dedicated to the proposed partitioning strategy, which considers all architectural features of the COTS NoC. Additionally, it is shown how to configure the parameters for flow-regulation on the NoC, such that the Worst-Case Traversal Time (WCTT) is minimal and buffers never overflow. The benefits of our approach are evaluated based on extensive experiments that show that contention is significantly reduced compared to the unconstrained case, while the proposed analysis outperforms a state-of-the-art analysis for the same platform. An industrial case study shows the tightness of the proposed analysis.
The advent of many-core processors came with the increase in computational power needed for future applications. However new challenges arrived at the same time, especially for the real-time community. Each core on such a processor is a heat source and uneven usage can lead to hot spots on the processor, affecting its lifetime and reliability. For real-time systems, it is therefore of paramount importance to keep the temperature differences between the individual cores below critical values, in order to prevent premature failure of the system. We argue that this problem can not be solved by traditional approaches, since the growing number of cores makes them intractable. We rather argue to split the problem in the spacial domain and control the temperature on core level. The cores control their temperature by rearranging the load in a predictable manner during runtime. To achieve this, a feedback controller is implemented on each core. We conclude our work with a simulation based evaluation of the proposed approach comparing its performance against a previously presented algorithm.
Programmable logic controllers are widely used for the control of automationsystems. The standard IEC 61131-3 defines the execution model as well as theprogramming languages for such systems. Nowadays, actuators and sensorsconnect to the programmable logic controller via automation buses. While suchbuses, as well as the sensors and actuators, become more and more powerful, ashift away from the current distributed operation of automation systems, closeto the field level, becomes possible. Instead, execution of complex controlfunctions can be relocated to more powerful hardware, and technologies. Thispaper presents an execution framework for IEC 61131-3, based on a many-coreprocessors. The presented execution model exploits the characteristics of theIEC 61131-3 applications as well as the characteristics of the many-core processor,yielding a predictable execution. We present the platform architectureand an algorithm to allocate a number of IEC 61131-3 conform applications.Experimental as well as simulation based evaluation is provided.
Many-Core systems, processors incorporating numerous cores interconnected by a Network on Chip (NoC), provide the computing power needed by future applications. High power density caused by the steadily shrinking transistor size, which is still following Moore's law, leads to a number of problems such as overheating cores, affecting processor reliability and lifetime. Embedded real-time systems are exposed to a changing ambient temperature and thus need to adapt their configuration in order to keep the individual core temperature below critical values. %Targeting embedded real-time systems, systems need to adapt to changing environments. In our approach a hysteresis controller is implemented on each core, triggering a redistribution of the cores and the transition into idle state allowing the core to cool down. We propose two approaches, one global and one local approach, to redistribute the tasks and relive overheating cores during runtime. We evaluate the two proposed approaches by comparing them against each other based on simulations.
In this paper we present a low overhead thermal management approach to increase reliability of many-core embedded real-time systems. Each core is controlled by a feedback controller. We adapt the utilization of the core in order to decrease the dynamic power consumption and thus the corresponding heat development. Sophisticated control mechanisms allow us to migrate the load in advance, before reaching critical temperature values and thus we can migrate in a safe way with a guarantee to meet all deadlines.
In this work we focus on the task mapping problem for many-core real-time systems. The growing number of cores connected by a Network-on-Chip (NoC) calls for sophisticated mapping techniques to meet the growing demands of real-time applications. Hardware should be used in an efficient way such that unnecessary resource usage is avoided. Because of the NP-hardness of the problem, heuristic and meta-heuristic techniques are used to find good solutions. We further consider periodic communication between tasks and we focus on a static mapping solution.
Load balancing is widely used to optimize response times and throughput of software systems. When considering embedded systems, however, additional optimization goals like energy consumption become relevant. In this paper, we explore the use of load balancing in embedded multicore applications. We present extensions to three prominent load balancing schemes, enabling them to dynamically scale the number of active cores. We integrated the algorithms in a proprietary operating system targeting multicore embedded systems. Our evaluation, which is based on a telecommunication (VoIP) scenario, shows that a significant reduction in energy consumption is possible.
Modern automotive software systems consist of hundreds of heterogeneous software applications, belonging to separated function domains and often developed within distributed automotive ecosystems consisting of original equipment manufactures, tier-1 and tier-2 companies. Hence, the development of modern automotive software systems is a formidable challenge. A well-known instrument for coping with the tremendous heterogeneity and complexity of modern automotive software systems is the use of architectural languages as a way of enabling different and specific views over these systems. However, the use of different architectural languages might come with the cost of reduced interoperability and automation as different languages might have weak to no integration. In this article, we tackle the challenge of integrating two architectural languages heavily used in the automotive domain for the design and timing analysis of automotive software systems: AMALTHEA and Rubus Component Model. The main contributions of this paper are i) a mapping scheme for the translation of an AMALTHEA architecture into a Rubus Component Model architecture where high-precision timing analysis can be run, and the back annotation of the analysis results on the starting AMALTHEA architecture; ii) the implementation of the proposed scheme, which uses the concept of model transformations for enabling a full-fledged automated integration; iii) the application of such automation on three industrial automotive systems being the brake-by-wire, the full blown engine management system and the engine management system. We discuss and evaluate the proposed contributions using an online, experts survey and the above-mentioned use cases. Based on the evaluation results, we conclude that the proposed automation mechanism is correct and applicable in industrial contexts. Besides, we observe that the performance of the automation mechanism does not degrade when translating large models with several thousands of elements. Eventually, we conclude that experts in this field find the proposed contribution industrially relevant.
This paper focuses on the mapping between twoindustrial architectural languages: AMALTHEA and RubusComponent Model. Both languages are heavily used within theautomotive domain for the design and timing analysis of automo-tive software, respectively. The main contribution of this paperis a mapping scheme between the two architectural languagesenabling i) the translation of an AMALTHEA architecture intoa Rubus Component Model architecture where high-precisiontiming analysis can be performed ii) and the back-propagationof the analysis results on the AMALTHEA architecture. Wevalidate the applicability of the proposed mapping scheme usingan industrial use case from the automotive domain: the brake-by-wire system. We discuss the industrial relevance and lessonslearnt of this work using expert interviews
The automotive E/E architectures are evolving from the traditional distributed architectures to upcoming consolidated domain architectures and possibly future centralised architectures. This paper demonstrates modelling and timing analysis of real-time embedded systems on contemporary automotive E/E architectures using the Rubus-ICE tool suite. The Rubus concept and tool suite, developed and evolved based on close academic-industrial collaboration, have been used in the automotive industry for over 25 years. The paper also demonstrates recent extensions and discusses proposals to support the modelling and timing analysis of the systems on future E/E architectures.
This paper addresses the scheduling of industrial time-critical applications on multi-core embedded systems. A novel scheduling technique under partitioned scheduling is proposed that minimizes inter-core data-propagation delays between tasks that are activated with different periods. The proposed technique is based on the AUTOSAR standard compliant read-execute-write model for the execution of tasks to guarantee temporal isolation when accessing the shared resources. The technique is evaluated through a series of experiments using a large number of task sets to assess its scalability as well as the resulting schedulability ratio, which is still 18% for two cores that are both utilized 90%. Furthermore, an automotive industrial case study is performed to demonstrate the applicability of the proposed technique to industrial systems. The case study also presents a comparative evaluation of the schedules generated by (i) the proposed technique and (ii) the Rubus-ICE industrial tool suite with respect to jitter, inter-core data-propagation delays and their impact on data age of task chains that span multiple cores.
The Network-on-Chip (NoC) is the on-chip interconnection medium of choice for modern massively parallel processors and System-on-Chip (SoC) in general. Fixed-priority based preemptive scheduling using virtual-channels is a solution to support real-time communications in on-chip networks. Targeting the priority assignment problem in the context of NoCs, heuristic based priority assignment algorithms are more practical, due to the exponentially increased search space as the number of flows goes up. In our previous work, we have proposed a graph-based heuristic priority assignment algorithm (called GHSA) for NoC communications, where we show that taking the dependencies between flows into account can significantly reduce the search space. However, GHSA only works for NoCs with distinct priorities. Routers in such type of platforms may have a large amount of buffer cost when the number of flows is high. The applicability can thus be limited in reality. One solution to reduce the buffer cost is to allow priority sharing of different flows. In this paper, we propose a dependency-graph based priority assignment algorithm (called eGHSA) targeting NoCs with shared virtual-channels. A number of experiments as well as a case study based on an automotive application are generated, which clearly show that eGHSA improves the efficiency compared to the existing solution in the literature.
Network-on-Chip (NoC) is a communication subsystem which has been widely utilized in many-core processors and system-on-chips in general. In this paper, we focus on a Round-Robin Arbitration (RRA) based wormhole-switched NoC which is a common architecture used in most of the existing implementations. In order to execute real-time applications on such a NoC based platform, a number of given real-time requirements need to be fulfilled. One of the most typical requirements is schedulability which refers to if real-time packets can be delivered within the given time durations. Timing analysis is a common tool to verify the schedulability of a real-time system. Unfortunately, the existing timing analyses of RRA-based NoCs either provide too pessimistic estimates which results in overly allocated resources, or require a large amount of processing which limits the applicability in reality. Therefore, in this paper, we present an improved timing analysis, aiming to provide more accurate estimates along with acceptable computation time. From the evaluation results, we can clearly observe the improvement achieved by the proposed timing analysis.
Network-on-Chip (NoC) is a communication sub-system which has been widely utilized in many-core processors and system-on-chips in general. In order to execute time-critical applications on a NoC-based platform, the timing behavior of the network needs to be predicted during system design. One of the most important timing requirements is regarding schedulability, which refers to determining if a real-time packet can be delivered within a specific time duration. To verify the fulfillment of such timing requirement, a proper timing analysis is mandatory. Our work focuses on a Round-Robin Arbitration (RRA) based wormhole-switched NoC, which is a common architecture used in many of the existing implementations. Recursive Calculus (RC) is one of the existing analysis approaches for RRA-based NoCs which has been utilized in many research works. However, RC does not take buffer-effects into account. As a result, while performing RC on most of the existing RRA-based NoC designs, it can produce unsafe estimates which is not acceptable for time-critical systems. In this paper, we identify the optimistic problem of RC, and we propose a Revised Recursive Calculus (RRC) which extends RC by considering buffer-effects as well as supporting packetization.
The Network-on-Chip is the on-chip interconnection medium of choice for modern massively parallel processors and System-on-Chip in general. Fixed-priority based preemptive scheduling using virtual-channels is a solution to support real-time communications in on-chip networks. However, the different characteristics of the Network-on-Chip compared to the single processor scheduling problem prevents the usage of known optimal algorithms (e.g. the Audsley's algorithm) to assign priorities to messages. A heuristic search algorithm based approach (called the HSA) focusing on the priority assignment for on-chip communications has been presented in the literature. The HSA is much faster than an exhaustive search based solution, with a price of missing certain schedulable cases (i.e. non-optimal). In this paper, we present two undirected-graph based priority assignment algorithms, the GESA and the GHSA. In contrast to the previous work, we can decrease the search space significantly by taking the interference dependencies of different messages on the network into account. A number of experiments are generated, in order to evaluate the proposed algorithms. The results show that the GESA can always achieve higher schedulability ratios than the HSA, but may require longer processing time. On the other hand, the GHSA has the same performance as the HSA regarding the schedulability, but can significantly improve the efficiency.
Network-on-Chip (NoC) is a preferred communi- cation medium for massively parallel platforms. Fixed-priority based scheduling using virtual-channels is one of the promising solutions to support real-time traffic in on-chip networks. Most of the existing NoC implementations which can support fixed- priority based scheduling use a flit-level preemptive scheduling. Under such a mechanism, preemptions can happen between the transmissions of successive flits. In this paper, we present a modified framework where the non-preemptive region of each NoC packet increases from a single flit. Using the proposed approach, the response times of certain packet flows can be reduced, which can thus improve the schedulability of the whole network. As a result, the utilization of NoCs can be improved by admitting more real-time traffic. Schedulability tests regarding the proposed framework are presented along with the proof of the correctness. Moreover, a number of experiments as well as a case study based on an automotive application have been generated, where we can clearly observe the improvement of our solution compared to the original flit-level preemptive NoC.
The Network-on-Chip (NoC) is the preferred inter- connection medium for massively parallel platforms. Targeting real-time applications, fixed-priority based NoCs with virtual- channels have been proposed as a promising solution. In order to verify if specific time requirements can be satisfied, scheduability tests are typically used. Several analysis approaches have been proposed targeting priority-based NoCs. However, due to the approximation considered in the analyses, the results may involve a large amount of pessimism. The applicability of the analyses is thus limited in practice. In this paper, we identify a number of properties of NoCs with shared priorities. An improved time analysis is proposed where pessimism can be significantly reduced for many cases. In order to evaluate the proposed analysis, a number of experiments have been generated along with a case study based on an automotive application. The improvement can be clearly observed from the evaluation results.
Network-on-Chip (NoC) is a preferred communication medium for massively parallel platforms. Fixed-priority based scheduling using virtual-channels is one of the promising solutions to support real-time traffic in on-chip networks. Most of the existing works regarding priority-based NoCs use a flit-level preemptive scheduling. Under such a mechanism, preemptions can only happen between the transmissions of successive flits but not during the transmission of a single flit. In this paper, we present a modified framework where the non-preemptive region of each NoC packet increases from a single flit. Using the proposed approach, the response times of certain traffic flows can be reduced, which can thus improve the schedulability of the whole network. As a result, the utilization of NoCs can be improved by admitting more real-time traffic. Schedulability tests regarding the proposed framework are presented along with the proof of the correctness. Additionally, we also propose a path modification approach on top of the non-preemptive region based method to further improve schedulability. A number of experiments have been performed to evaluate the proposed solutions, where we can observe significant improvement on schedulability compared to the original flit-level preemptive NoCs.
Network-on-Chip (NoC) is the interconnect of choice for many- core processors and system-on-chips in general. Most of the existing NoC designs focus on the performance with respect to average throughput, which makes them less applicable for real-time applications especially when applications have hard timing requirements on the worst-case scenarios. In this paper, we focus on a Round- Robin Arbitration (RRA) based wormhole-switched NoC which is a common architecture used in most of the existing implementations. We propose a novel segmentation algorithm targeting RRA-based NoCs in order to improve the schedulability of real-time traffic without modifying the hardware architecture. According to the evaluation results, the proposed segmentation solution can signifi- cantly improve the schedulability of the whole network.
Network-on-Chip (NoC) is the interconnect of choice for many- core processors and system-on-chips in general. Most of the exist- ing NoC designs focus on the performance with respect to average throughput, which makes them less applicable for real-time appli- cations especially when applications have hard timing requirements on the worst-case scenarios. In this paper, we focus on a Round- Robin Arbitration (RRA) based wormhole-switched NoC which is a common architecture used in most of the existing implementa- tions. We propose a novel segmentation algorithm targeting RRA- based NoCs in order to improve the schedulability of real-time traf- fic without modifying the hardware architecture. Additionally, we also address the problem of transmitting both real-time traffic and best-effort traffic in the same NoC. The proposed solutions aim to provide timing guarantees to real-time traffic and achieve low la- tency for best-effort traffic. According to the evaluation results, the proposed segmentation solution can significantly improve the schedulability of the whole network.
The IEC 61131-3 standard, a widely used standard in the automation industry, defines various programming languages for programmable logic controllers. Today, the open source tools that comply with this standard do not support deployment of the applications on multi-core platforms. In this paper, we introduce a novel multi-step approach that aims to support automatic deployment of the automation control applications, developed using the IEC 61131-3 standard, to multi-core platforms. In the first step, the generated sequential code is partitioned. In the second step, the partitioned code is allocated to tasks while the tasks are mapped to various cores, without violating the dependencies, synchronization and communication constraints in the application. In order to provide a proof of concept, we develop a prototype by extending an existing tool that complies with the standard. We also perform a case study and a preliminary evaluation of the prototype.
Autonomous driving is one of the main challenges of modern cars. Computer visions and intelligent on-board decision making are crucial in autonomous driving and require heterogeneous processors with high computing capability under low power consumption constraints. The progress of parallel computing using heterogeneous processing units is further supported by software frameworks like OpenCL, OpenMP, CUDA, and C++AMP. These frameworks allow the allocation of parallel computation on different compute resources. This, however, creates a difficulty in allocating the right computation segments to the right processing units in such a way that the complete system meets all its timing requirements. In this paper, we consider pre-runtime static allocations of parallel tasks to perform their execution either sequentially on CPU or in parallel using a GPU. This allows for improving any unbalanced use of GPU accelerators in a heterogeneous environment. By performing several heuristic algorithms, we show that the overuse of accelerators results in a bottle-neck of the entire system execution. The experimental results show that our allocation schemes that target a balanced use of GPU improve the system schedulability up to 90%.
During recent years, the interest in using heterogeneous computing architecture in industrial applications has increased dramatically. These architectures provide the computational power that makes them attractive for many industrial applications. However, most of these existing heterogeneous architectures suffer from the following limitations: difficulties of heterogeneous parallel programming and high communication cost between the computing units. To overcome these disadvantages, several leading hardware manufacturers have formed the HSA Foundation to develop a new hardware architecture: Heterogeneous System Architecture (HSA). In this paper, we investigate the suitability of using HSA for real-time embedded systems. A preliminary experimental study has been conducted to measure massive computing power and timing predictability of HSA.