Utilizing edge and cloud computing to empower the profitability of manufacturing is drastically increasing in modern industries. As a result of that, several challenges have raised over the years that essentially require urgent attention. Among these, coping with different faults in edge and cloud computing and recovering from permanent and temporary faults became prominent issues to be solved. In this paper, we focus on the challenges of applying fault tolerance techniques on edge and cloud computing in the context of manufacturing and we investigate the current state of the proposed approaches by categorizing them into several groups. Moreover, we identify critical gaps in the research domain as open research directions.
In the era of the IoT revolution, applications are becoming ever more sophisticated and accompanied by diverse functional and non-functional requirements, including those related to computing resources and performance levels. Such requirements make the development and implementation of these applications complex and challenging. Computing models, such as cloud computing, can provide applications with on-demand computation and storage resources to meet their needs. Although cloud computing is a great enabler for IoT and endpoint devices, its limitations make it unsuitable to fulfill all design goals of novel applications and use cases. Instead of only relying on cloud computing, leveraging and integrating resources at different layers (like IoT, edge, and cloud) is necessary to form and utilize a computing continuum. The layers’ integration in the computing continuum offers a wide range of innovative services, but it introduces new challenges (e.g., monitoring performance and ensuring security) that need to be investigated. A better grasp and more profound understanding of the computing continuum can guide researchers and developers in tackling and overcoming such challenges. Thus, this paper provides a comprehensive and unified view of the computing continuum. The paper discusses computing models in general with a focus on cloud computing, the computing models that emerged beyond the cloud, and the communication technologies that enable computing in the continuum. In addition, two novel reference architectures are presented in this work: one for edge–cloud computing models and the other for edge–cloud communication technologies. We demonstrate real use cases from different application domains (like industry and science) to validate the proposed reference architectures, and we show how these use cases map onto the reference architectures. Finally, the paper highlights key points that express the authors’ vision about efficiently enabling and utilizing the computing continuum in the future.
The paper presents an approach to solve the software and hardware related failures in edge-cloud environments, more precisely, in cloud manufacturing environments. The proposed approach, called TOLERANCER, is composed of distributed components that continuously interact in a peer to peer fashion. Such interaction aims to detect stress situations or node failures, and accordingly, TOLERANCER makes decisions to avoid or solve any potential system failures. The efficacy of the proposed approach is validated through a set of experiments, and the performance evaluation shows that it responds effectively to different faults scenarios.
In cloud computing model, resource sharing introduces major benefits for improving resource utilization and total cost of ownership, but it can create technical challenges on the running performance. In practice, orchestrators are required to allocate sufficient physical resources to each Virtual Machine (VM) to meet a set of predefined performance goals. To ensure a specific service level objective, the orchestrator needs to be equipped with a dynamic tool for assigning computing resources to each VM, based on the run-Time state of the target environment. To this end, we present LOOPS, a multi-loop control approach, to allocate resources to VMs based on the service level agreement (SLA) requirements and the run-Time conditions. LOOPS is mainly composed of one essential unit to monitor VMs, and three control levels to allocate resources to VMs based on requests from the essential node. A tailor-made controller is proposed with each level to regulate contention among collocated VMs, to reallocate resources if required, and to migrate VMs from one host to another. The three levels work together to meet the required SLA. The experimental results have shown that the proposed approach can meet applications' performance goals by assigning the resources required by cloud-based applications.
In this paper, we investigate the problem of modeling time-series as a process generated through (i) switching between several independent sub-models; (ii) where each sub-model has heteroskedastic noise, and (iii) a polynomial bias, describing nonlinear dependency on system input. First, we propose a generic nonlinear and heteroskedastic statistical model for the process. Then, we design Maximum Likelihood (ML) parameters estimation method capable of handling heteroscedasticity and exploiting constraints on model structure. We investigate solving the intractable ML optimization using population-based stochastic numerical methods. We then find possible model change-points that maximize the likelihood without over-fitting measurement noise. Finally, we verify the usefulness of the proposed technique in a practically relevant case study, the execution-time of odometry estimation for a robot operating radar sensor, and evaluate the different proposed procedures using both simulations and field data.
This study is based on a Moving Target Defence (MTD) algorithm designed to introduce uncertainty into the controller and another layer of uncertainty to intrusion detection. This randomness complicates the adversary's attempts to craft stealthy attacks while concurrently minimizing the impact of false-data injection attacks. Leveraging concepts from state observer design, the method establishes an optimization framework to determine the parameters of the random signals. These signals are strategically tuned to increase the detectability of stealthy attacks while reducing the deviation resulting from false data injection attempts. We propose here to use two different state observers and two associated MTD algorithms. The first one optimizes the parameters of the random signals to reduce the deviation resulting from false data injection attempts and maintain the stability of the closed-loop system with the desired level of performance. In contrast, the second one optimizes the parameters of the random signals to increase the detectability of stealthy attacks. Dividing the optimization problem into two separate optimization processes simplifies the search process and makes it possible to have higher values of the detection cost function. To illustrate the effectiveness of our approach, we present a case study involving a generic linear time-invariant system and compare the results with a recently published algorithm.
With the rapidly growing use of Multi-Agent Systems (MASs), which can exponentially increase the system complexity, the problem of planning a mission for MASs became more intricate. In some MASs, human operators are still involved in various decision-making processes, including manual mission planning, which can be an ineffective approach for any non-trivial problem. Mission planning and re-planning can be represented as a combinatorial optimization problem. Computing a solution to these types of problems is notoriously difficult and not scalable, posing a challenge even to cutting-edge solvers. As time is usually considered an essential resource in MASs, automated solvers have a limited time to provide a solution. The downside of this approach is that it can take a substantial amount of time for the automated solver to provide a sub-optimal solution. In this work, we are interested in the interplay between a human operator and an automated solver and whether it is more efficient to let a human or an automated solver handle the planning and re-planning problems, or if the combination of the two is a better approach. We thus propose an experimental setup to evaluate the effect of having a human operator included in the mission planning and re-planning process. Our tests are performed on a series of instances with gradually increasing complexity and involve a group of human operators and a metaheuristic solver based on a genetic algorithm. We measure the effect of the interplay on both the quality and structure of the output solutions. Our results show that the best setup is to let the operator come up with a few solutions, before letting the solver improve them.
Self-adaptive software systems monitor their operation and adapt when their requirements fail due to unexpected phenomena in their environment. This paper examines the case where the environment changes dynamically over time and the chosen adaptation has to take into account such changes. In control theory, this type of adaptation is known as Model Predictive Control and comes with a well-developed theory and myriads of successful applications. The paper focuses on modelling the dynamic relationship between requirements and possible adaptations. It then proposes a controller that exploits this relationship to optimize the satisfaction of requirements relative to a cost-function. This is accomplished through a model-based framework for designing self-adaptive software systems that can guarantee a certain level of requirements satisfaction over time, by dynamically composing adaptation strategies when necessary. The proposed framework is illustrated and evaluated through a simulation of the Meeting-Scheduling System exemplar.
Dynamic Software Product Lines (DSPLs) are a well-accepted approach to self-adaptation at runtime. In the context of DSPLs, there are plenty of reactive approaches that apply countermeasures as soon as a context change happens. In this paper we propose a proactive approach, PRODSPL, that exploits an automatically learnt model of the system, anticipates future variations of the system and generates the best DSPL configuration that can lessen the negative impact of future events on the quality requirements of the system. Predicting the future fosters adaptations that are good for a longer time and therefore reduces the number of reconfigurations required, making the system more stable. PRODSPL formulates the problem of the generation of dynamic reconfigurations as a proactive controller over a prediction horizon, which includes a mapping of the valid configurations of the DSPL into linear constraints. Our approach is evaluated and compared with a reactive approach, DAGAME, also based on a DSPL, which uses a genetic algorithm to generate quasi-optimal feature model configurations at runtime. PRODSPL has been evaluated using a strategy mobile game and a set of randomly generated feature models. The evaluation shows that PRODSPL gives good results with regard to the quality of the configurations generated when it tries anticipate future events. Moreover, in doing so, PRODSPL enforces the system to make as few reconfigurations as possible.
Industrial Augmented Reality (IAR) is a key enabling technology for Industry 4.0. However, its adoption poses several challenges because it requires the execution of computing-intensive tasks in devices with poor computational resources, which contributes to a faster draining of the device batteries. Proactive self-adaptation techniques could overcome these problems that affect the quality of experience by optimizing computational resources and minimizing user disturbance. In this work, we propose to apply ProDSPL, a proactive Dynamic Software Product Line, for the self-adaptation of IAR applications to satisfy the quality requirements. ProDSPL is compared against MODAGAME, a multi-objective DSPL approach that uses a genetic algorithm to generate quasi-optimal feature model configurations at runtime. The evaluation with randomly generated feature models running on mobile devices shows that ProDSPL gives results closer to the Pareto optimal than MODAGAME.
This is an extended abstract of the article: Inmaculada Ayala, Alessandro V. Papadopoulos, Mercedes Amor, Lidia Fuentes, ProDSPL: Proactive self-adaptation based on Dynamic Software Product Lines, Journal of Systems and Software, Volume 175, 2021, 110909, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2021.110909.
It is a standard engineering practice to design feedback-based control to have a system follow a given trajectory. While the trajectory is continuous-time, the sequence of references is varied at discrete times as it is normally computed by digital systems.In this work, we propose a method to determine the optimal discrete-time references to be applied over a time window of a given duration. The optimality criterion is the minimization of a weighted L2 norm between the achieved trajectory and a given target trajectory which is desired to be followed. The proposed method is then assessed over different simulation results, analyzing the design parameters' effects, and over a UAV use case. The code to reproduce the results is publicly available.
The Time-Sensitive Networking (TSN) standards provide a toolbox of features to be utilized in various application domains.The core TSN features include deterministic zero-jitter and low-latency data transmission and transmitting traffic with various levels of time-criticality on the same network. To achieve a deterministic transmission, the TSN standards define a time-aware shaper that coordinates transmission of Time-Triggered (TT) traffic. In this paper, we tackle the challenge of scheduling the TT traffic and we propose a heuristic algorithm, called HERMES. Unlike the existing scheduling solutions, HERMES results in a significantly faster algorithm run-time and a high number of schedulable networks. HERMES can be configured in two modes of zero or relaxed reception jitter while using multiple TT queues to improve the schedulability. We compare HERMES with a constraint programming (CP)-based solution and we show that HERMES performs better than the CP-based solution if multiple TT queues are used, both with respect to algorithm run-time and schedulability of the networks.
This paper proposes a method to efficiently map the legacy Ethernet-based traffic into Time Sensitive Networking (TSN) traffic classes considering different traffic characteristics. Traffic mapping is one of the essential steps for industries to gradually move towards TSN, which in turn significantly mitigates the management complexity of industrial communication systems. In this paper, we first identify the legacy Ethernet traffic characteristics and properties. Based on the legacy traffic characteristics we presented a mapping methodology to map them into different TSN traffic classes. We implemented the mapping method as a tool, named Legacy Ethernet-based Traffic Mapping Tool or LETRA, together with a TSN traffic scheduling and performed a set of evaluations on different synthetic networks. The results show that the proposed mapping method obtains up to 90% improvement in the schedulability ratio of the traffic compared to an intuitive mapping method on a multi-switch network architecture.
In order to enable the adoption of Time Sensitive Networking (TSN) by the industry and be more environmentally sustainable, it is necessary to develop tools to integrate legacy systems with TSN. In this paper, we propose a solution for the coexistence of different time domains from different legacy systems with their corresponding synchronization protocols in a single TSN network. To this end, we experimentally identified the effects of replacing the communications subsystem of a legacy Ethernet-based network with TSN in terms of synchronization. Based on the results, we propose a solution called TALESS (TSN with Legacy End-Stations Synchronization). TALESS is able to identify the drift between the TSN communications subsystem and the integrated legacy devices (end-stations) and modify the TSN schedule to adapt to the different time domains to avoid the effects of the lack of synchronization between them. We validate TALESS through both simulations and experiments on a prototype. Thereby we demonstrate that thanks to TALESS, legacy systems are able to synchronize through TSN and even improve features such as their reception jitter or their integrability with other legacy systems.
Moving towards new technologies, such as Time Sensitive Networking (TSN), in industries should be gradual with a proper integration process instead of replacing the existing ones to make it beneficial in terms of cost and performance. Within this context, this paper identifies the challenges of integrating a legacy EtherCAT network, as a commonly used technology in the automation domain, into a TSN network. We show that clock synchronization plays an essential role when it comes to EtherCAT-TSN network integration with important requirements. We propose a clock synchronization mechanism based on the TSN standards to obtain a precise synchronization among EtherCAT nodes, resulting to an efficient data transmission. Based on a formal verification framework using UPPAAL tool we show that the integrated EtherCAT-TSN network with the proposed clock synchronization mechanism achieves at least 3 times higher synchronization precision compared to not using any synchronization.
In this paper, we present our ongoing work on proposing solutions to integrate legacy end-stations into Time-Sensitive Network (TSN) communication systems where the legacy end-stations are synchronized via their legacy clock synchronization protocol. To this end, we experimentally identify the effects of lacking synchronization or partial synchronization in TSN networks. In the experiments we show the effects of clock synchronization in different scenarios on jitter and clock drifts. Based on the experiments, we propose preliminary solutions to overcome the identified effects.
Offline scheduling of Scheduled Traffic (ST) in Time-Sensitive Networks (TSN) without taking into account the quality of service of non-ST traffic, e.g., time-sensitive traffic such as Audio-Video Bridging (AVB) traffic, can potentially cause deadline misses for non-ST traffic. In this paper, we report our ongoing work to propose a solution that, regardless of the ST scheduling algorithm being used, can ensure meeting timing requirements for non-ST traffic. To do this, we define a frame called Guard Frame (GF) that will be scheduled together with all ST frames. We show that a proper design for the GFs will leave necessary porosity in the ST schedules to ensure that all non-ST traffic will meet their timing requirements.
In order to facilitate the adoption of Time Sensitive Networking (TSN) by the industry, it is necessary to develop tools to integrate legacy systems with TSN. In this article, we propose a solution for the coexistence of different time domains from different legacy systems, each with its corresponding synchronization protocol, in a single TSN network. To this end, we experimentally identified the effects of replacing the communications subsystem of a legacy Ethernet-based network with TSN in terms of synchronization. Based on the results, we propose a solution called TALESS (TSN with Legacy End-Stations Synchronization). TALESS can identify the drift between the TSN communications subsystem and the integrated legacy devices (end-stations) and then modify the TSN schedule to adapt to the different time domains to avoid the effects of the lack of synchronization between them. We validate TALESS through both simulations and experiments on a prototype. We demonstrate that thanks to TALESS, legacy systems can synchronize through TSN and even improve features such as their reception jitter or their integrability with other legacy systems.
A traditional approach to realize self-adaptation in software engineering (SE) is by means of feedback loops. The goals of the system can be specified as formal properties that are verified against models of the system. On the other hand, control theory (CT) provides a well-established foundation for designing feedback loop systems and providing guarantees for essential properties, such as stability, settling time, and steady state error. Currently, it is an open question whether and how traditional SE approaches to self-adaptation consider properties from CT. Answering this question is challenging given the principle differences in representing properties in both fields. In this paper, we take a first step to answer this question. We follow a bottom up approach where we specify a control design (in Simulink) for a case inspired by Scuderia Ferrari (F1) and provide evidence for stability and safety. The design is then transferred into code (in C) that is further optimized. Next, we define properties that enable verifying whether the control properties still hold at code level. Then, we consolidate the solution by mapping the properties in both worlds using specification patterns as common language and we verify the correctness of this mapping. The mapping offers a reusable artifact to solve similar problems. Finally, we outline opportunities for future work, particularly to refine and extend the mapping and investigate how it can improve the engineering of self-adaptive systems for both SE and CT engineers.
Two of the main paradigms used to build adaptive software employ different types of properties to capture relevant aspects of the system’s run-time behavior. On the one hand, control systems consider properties that concern static aspects like stability, as well as dynamic properties that capture the transient evolution of variables such as settling time. On the other hand, self-adaptive systems consider mostly non-functional properties that capture concerns such as performance, reliability, and cost. In general, it is not easy to reconcile these two types of properties or identify under which conditions they constitute a good fit to provide run-time guarantees. There is a need of identifying the key properties in the areas of control and self-adaptation, as well as of characterizing and mapping them to better understand how they relate and possibly complement each other. In this paper, we take a first step to tackle this problem by: (1) identifying a set of key properties in control theory, (2) illustrating the formalization of some of these properties employing temporal logic languages commonly used to engineer self-adaptive software systems, and (3) illustrating how to map key properties that characterize self-adaptive software systems into control properties, leveraging their formalization in temporal logics. We illustrate the different steps of the mapping on an exemplar case in the cloud computing domain and conclude with identifying open challenges in the area.
This paper highlights cloud computing as one of the principal building blocks of a smart factory, providing a huge data storage space and a highly scalable computational capacity. The cloud computing system used in a smart factory should be time-predictable to be able to satisfy hard real-time requirements of various applications existing in manufacturing systems. Interleaving an intermediate computing layer-called fog-between the factory and the cloud data center is a promising solution to deal with latency requirements of hard real-time applications. In this paper, a time-predictable cloud framework is proposed which is able to satisfy end-to-end latency requirements in a smart factory. To propose such an industrial cloud framework, we not only use existing real-time technologies such as Industrial Ethernet and the Real-time XEN hypervisor, but we also discuss unaddressed challenges. Among the unaddressed challenges, the partitioning of a given workload between the fog and the cloud is targeted. Addressing the partitioning problem not only provides a resource provisioning mechanism, but it also gives us a prominent design decision specifying how much computing resource is required to develop the fog platform, and how large should the minimum communication bandwidth be between the fog and the cloud data center.
The pervasiveness and growing complexity of software systems is challenging software engineering to de- sign systems that can adapt their behavior to withstand unpredictable, uncertain, and continuously chang- ing execution environments. Control theoretical adaptation mechanisms received a growing interest from the software engineering community in the last years for their mathematical grounding allowing formal guarantees on the behavior of the controlled systems. However, most of these mechanisms are tailored to specific applications and can hardly be generalized into broadly applicable software design and development processes. This paper discusses a reference control design process, from goal identification to the verification and validation of the controlled system. A taxonomy of the main control strategies is introduced, analyzing their applicability to software adaptation for both functional and non-functional goals. A brief extract on how to deal with uncertainty complements the discussion. Finally, the paper highlights a set of open challenges, both for the software engineering and the control theory research communities.
Multi-agent systems can be prone to failures during the execution of a mission, depending on different circumstances, such as the harshness of the environment they are deployed in. As a result, initially devised plans for completing a mission may no longer be feasible, and a re-planning process needs to take place to re-allocate any pending tasks. There are two main approaches to solve the re-planning problem (i) global re-planning techniques using a centralized planner that will redo the task allocation with the updated world state and (ii) decentralized approaches that will focus on the local plan reparation, i.e., the re-allocation of those tasks initially assigned to the failed robots, better suited to a dynamic environment and less computationally expensive. In this paper, we propose a hybrid approach, named GLocal, that combines both strategies to exploit the benefits of both, while limiting their respective drawbacks. GLocal was compared to a planner-only, and an agent-only approach, under different conditions. We show that GLocal produces shorter mission make-spans as the number of tasks and failed agents increases, while also balancing the tradeoff between the number of messages exchanged and the number of requests to the planner.
Moving nodes in a Mobile Wireless Sensor Network (MWSN) typically have two maintenance objectives: (i) extend the coverage of the network as long as possible to a target area, and (ii) extend the longevity of the network as much as possible. As nodes move and also route traffic in the network, their battery levels deplete differently for each node. Dead nodes lead to loss of connectivity and even to disengaging full parts of the network. Several reactive and rule-based approaches have been proposed to solve this issue by adapting redeployment to depleted nodes. However, in large networks a cooperative approach may increase performance by taking the evolution of node battery and traffic into account. In this paper, we present a hybrid agent-based architecture that addresses the problem of depleting nodes during the maintenance phase of a MWSN. Agents, each assigned to a node, collaborate and adapt their behaviour to their battery levels. The collaborative behavior is modeled through the willingness to interact abstraction, which defines when agents ask and give help to one another. Thus, depleting nodes may ask to be replaced by healthier counterparts and move to areas with less traffic or to a collection point. At the lower level, negotiations trigger a reactive navigation behaviour based on Social Potential Fields (SPF). It is shown that the proposed method improves coverage and extends network longevity in an environment without obstacles as compared to SPF alone.
Adaptive autonomy plays a major role in the design of multi-robots and multi-agent systems, where the need of collaboration for achieving a common goal is of primary importance. In particular, adaptation becomes necessary to deal with dynamic environments, and scarce available resources. In this paper, a mathematical framework for modelling the agents' willingness to interact and collaborate, and a dynamic adaptation strategy for controlling the agents' behavior, which accounts for factors such as progress toward a goal and available resources for completing a task among others, are proposed. The performance of the proposed strategy is evaluated through a fire rescue scenario, where a team of simulated mobile robots need to extinguish all the detected fires and save the individuals at risk, while having limited resources. The simulations are implemented as a ROS-based multi agent system, and results show that the proposed adaptation strategy provides a more stable performance than a static collaboration policy.
When multiple robots are required to collaborate in order to accomplish a specific task, they need to be coordinated in order to operate efficiently. To allow for scalability and robustness, we propose a novel distributed approach performed by autonomous robots based on their willingness to interact with each other. This willingness, based on their individual state, is used to inform a decision process of whether or not to interact with other robots within the environment. We study this new mechanism to form coalitions in the on-line multi-object κ -coverage problem, and evaluate its performance through two sets of experiments, in which we also compare to other methods from the state-of-art. In the first set we focus on scenarios with static and mobile targets, as well as with a different number of targets. Whereas in the second, we carry out an extensive analysis of the best performing methods focusing only on mobile targets, while also considering targets that appear and disappear during the course of the experiments. Results show that the proposed method is able to provide comparable performance to the best methods under study.
When multiple robots are required to collaborate in order to accomplish a specific task, they need to be coordinated in order to operate efficiently. To allow for scalability and robustness, we propose a novel distributed approach performed by autonomous robots based on their willingness to interact with each other. This willingness, based on their individual state, is used to inform a decision process of whether or not to interact with other robots within the environment. We study this new mechanism to form coalitions in the on-line multiobject κ-coverage problem, and compare it with six other methods from the literature. We investigate the trade-off between the number of robots available and the number of potential targets in the environment. We show that the proposed method is able to provide comparable performance to the best method in the case of static targets, and to achieve a higher level of coverage with respect to the other methods in the case of mobile targets.
Multi-robot systems can be prone to failures during plan execution, depending on the harshness of the environment they are deployed in. As a consequence, initially devised plans may no longer be feasible, and a re-planning process needs to take place to re-allocate any pending tasks. Two main approaches emerge as possible solutions, a global re-planning technique using a centralized planner that will redo the task allocation with the updated world state information, or a decentralized approach that will focus on the local plan reparation, i.e., the re-allocation of those tasks initially assigned to the failed robots.The former approach produces an overall better solution, while the latter is less computationally expensive.The goal of this paper is to exploit the benefits of both approaches, while minimizing their drawbacks. To this end, we propose a hybrid approach {that combines a centralized planner with decentralized multi-agent planning}. In case of an agent failure, the local plan reparation algorithm tries to repair the plan through agent negotiation. If it fails to re-allocate all of the pending tasks, the global re-planning algorithm is invoked, which re-allocates all unfinished tasks from all agents.The hybrid approach was compared to planner approach, and it was shown that it improves on the makespan of a mission in presence of different numbers of failures,as a consequence of the local plan reparation algorithm.
In recent years, autonomous systems have become an important research area and application domain, with a significant impact on modern society. Such systems are characterized by different levels of autonomy and complex communication infrastructures that allow for collective decision-making strategies. There exist several publications that tackle ethical aspects in such systems, but mostly from the perspective of a single agent. In this paper we go one step further and discuss these ethical challenges from the perspective of an aggregate of autonomous systems capable of collective decision-making. In particular, in this paper, we propose the Caesar approach through which we model the collective ethical decision-making process of a group of actors—agents and humans, as well as define the building blocks for the agents participating in such a process, namely Caesar agents. Factors such as trust, security, safety, and privacy, which affect the degree to which a collective decision is ethical, are explicitly captured in Caesar. Finally, we argue that modeling the collective decision-making in Caesar provides support for accountability.
In the recent works that analyzed execution-time variation of real-time tasks, it was shown that such variation may conform to regular behavior. This regularity may arise from multiple sources, e.g., due to periodic changes in hardware or program state, program structure, inter-task dependence or inter-task interference. Such complexity can be better captured by a Markov Model, compared to the common approach of assuming independent and identically distributed random variables. However, despite the regularity that may be described with a Markov model, over time, the execution times may change, due to irregular changes in input, hardware state, or program state. In this paper, we propose a Bayesian approach to adapt the emission distributions of the Markov Model at runtime, in order to account for such irregular variation. A preprocessing step determines the number of states and the transition matrix of the Markov Model from a portion of the execution time sequence. In the preprocessing step, segments of the execution time trace with similar properties are identified and combined into clusters. At runtime, the proposed method switches between these clusters based on a Generalized Likelihood Ratio (GLR). Using a Bayesian approach, clusters are updated and emission distributions estimated. New clusters can be identified and clusters can be merged at runtime. The time complexity of the online step is $O(N^{2}+ NC)$ where N is the number of states in the Hidden Markov Model (HMM) that is fixed after the preprocessing step, and C is the number of clusters.
Probabilistic approaches have gained attention over the past decade, providing a modeling framework that enables less pessimistic analysis of real-time systems. Among the different proposed approaches, Markov chains have been shown effective for analyzing real-time systems, particularly in estimating the pending workload distribution and deadline miss probability. However, the state-of-the-art mainly considered discrete emission distributions without investigating the benefits of continuous ones. In this paper, we propose a method for analyzing the workload probability distribution and bounding the deadline miss probability for a task executing in a reservation-based server, where execution times are described by a Markov model with Gaussian emission distributions. The evaluation is performed for the timing behavior of a Kalman filter for Furuta pendulum control. Deadline miss probability bounds are derived with a workload accumulation scheme. The bounds are compared to 1) measured deadline miss ratios of tasks running under the Linux Constant Bandwidth Server with SCHED-DEADLINE, 2) estimates derived from a Markov Model with discrete-emission distributions (PROSIT), 3) simulation-based estimates, and 4) an estimate assuming independent execution times. The results suggest that the proposed method successfully upper bounds actual deadline miss probabilities. Compared to the discrete-emission counterpart, the computation time is independent of the range of the execution times under analysis, and resampling is not required.
Estimating the response times of real-time tasks and applications is important for the analysis and implementation of real-time systems. Probabilistic approaches have gained attention over the past decade, as they provide a modeling framework that allows for less pessimism for the analysis of real-time systems.Among the different proposed approaches, Markov chains have been shown to be effective for the analysis of real-time systems, in particular, in the estimate of the pending workload probability distribution and of the deadline miss probability. However, this has been analyzed only for discrete emission distributions, but not for continuous ones.In this paper, we propose a method for analyzing the workload probability distribution and bounding the deadline miss probability for a task executing in a Constant Bandwidth Server, where execution times are described by a Markov model with Gaussian emission distributions.In the evaluation, deadline miss probability bounds and estimates are derived with a workload accumulation scheme. The results are compared to simulation and measured deadline miss ratios from tasks under the Linux Constant Bandwidth Server implementation SCHED_DEADLINE.
It has been shown that in some robotic applications, where the execution times cannot be assumed to be independent and identically distributed, a Markov Chain with discrete emission distributions can be an appropriate model. In this paper we investigate whether execution times can be modeled as a Markov Chain with continuous Gaussian emission distributions. The main advantage of this approach is that the concept of distance is naturally incorporated. We propose a framework based on Hidden Markov Model (HMM) methods that 1) identifies the number of states in the Markov Model from observations and fits the Markov Model to observations, and 2) validates the proposed model with respect to observations. Specifically, we apply a tree-based cross-validation approach to automatically find a suitable number of states in the Markov model. The estimated models are validated against observations, using a data consistency approach based on log likelihood distributions under the proposed model. The framework is evaluated using two test cases executed on a Raspberry Pi Model 3B+ single-board computer running Arch Linux ARM patched with PREEMPT_RT. The first is a simple test program where execution times intentionally vary according to a Markov model, and the second is a video decompression using the ffmpeg program. The results show that in these cases the framework identifies Markov Chains with Gaussian emission distributions that are valid models with respect to the observations.
Probabilistic timing analysis techniques have been proposed for real-time systems to remedy the problems that deterministic estimates of the task's Worst-Case Execution Time and Worst-Case Response-Time can be both intractable and overly pessimistic. Often, assumptions are made that a task's response time and execution time probability distributions are independent of the other tasks. This assumption may not hold in real systems. In this paper, we analyze the timing behavior of a simple periodic task on a Raspberry Pi model 3 running Arch Linux ARM. In particular, we observe and analyze the distributions of wake-up latencies and execution times for the sequential jobs released by a simple periodic task. We observe that the timing behavior of jobs is affected by release events during the job's execution time, and of other processes running in between subsequent jobs of the periodic task. Using a data consistency approach we investigate whether it is reasonable to model the timing distribution of jobs affected by release events and intermediate processes as translations of the empirical timing distribution of non-affected jobs. According to the analysis, this paper shows that a translated distribution model of non-affected jobs is invalid for the execution time distribution of jobs affected by intermediate processes. Regarding the wake-up latency distribution with intermediate processes, a translated distribution model is improbable, but cannot be completely ruled out.
Security attacks on sensor data can deceive a control system and force the physical plant to reach an unwanted and potentially dangerous state. Therefore, attack detection mechanisms are employed in cyber-physical control systems to detect ongoing attacks, the most prominent one being a threshold-based anomaly detection method called CUSUM. Literature defines the maximum impact of stealth attacks as the maximum deviation in the plant's state that an undetectable attack can introduce, and formulates it as an optimization problem. This paper proposes an optimization-based attack with different saturation models, and it investigates how the attack duration significantly affects the impact of the attack on the state of the plant. We show that more dangerous attacks can be discovered when allowing saturation of the control system actuators. The proposed approach is compared with the geometric attack, showing how longer attack durations can lead to a greater impact of the attack while keeping the attack stealthy. © 2022 IEEE.
Manipulating sensor data can deceive cyber-physical systems (CPSs), leading to hazardous conditions in physical plants. An Anomaly Detection System (ADS) like CUSUM detects ongoing attacks by comparing sensor signals with those generated by a model. However, physics-based methods are threshold-based, which can result in both false positives and undetectable attacks. This can lead to undetected attacks impacting the system state and potentially causing large deviations from the desired behavior. In this paper, we introduce a metric called transparency that uniquely quantifies the effectiveness of an ADS in terms of its ability to prevent state deviation. While existing research focuses on designing optimal zero-alarm stealth attacks, we address the challenge of detecting more sophisticated multi-alarm attacks that generate alarms at a rate comparable to the system noise. Through our analysis, we identify the conditions that require the inclusion of multi-alarm scenarios in worst-case impact assessments. We also propose an optimization problem designed to identify multi-alarm attacks by relaxing the constraints of a zero-alarm attack problem. Our findings reveal that multi-alarm attacks can cause a more significant state deviation than zero-alarm attacks, emphasizing their critical importance in the security analysis of control systems.
Stream processing applications extract value from raw data through Directed Acyclic Graphs of data analysis tasks. Shared-nothing (SN) parallelism is the de-facto standard to scale stream processing applications. Given an application, SN parallelism ins9tantiates several copies of each analysis task, making each instance responsible for a dedicated portion of the overall analysis, and relies on dedicated queues to exchange data among connected instances. On the one hand, SN parallelism can scale the execution of applications both up and out since threads can run task instances within and across processes/nodes. On the other hand, its lack of sharing can cause unnecessary overheads and hinder the scaling up when threads operate on data that could be jointly accessed in shared memory. This trade-off motivated us in studying a way for stream processing applications to leverage shared memory and boost the scale up (before the scale out) while adhering to the widely-adopted and SN-based APIs for stream processing applications. We introduce STRETCH, a framework that maximizes the scale up and offers instantaneous elastic reconfigurations (without state transfer) for stream processing applications. We propose the concept of Virtual Shared-Nothing (VSN) parallelism and elasticity and provide formal definitions and correctness proofs for the semantics of the analysis tasks supported by STRETCH, showing they extend the ones found in common Stream Processing Engines. We also provide a fully implemented prototype and show that STRETCH's performance exceeds that of state-of-the-art frameworks such as Apache Flink and offers, to the best of our knowledge, unprecedented ultra-fast reconfigurations, taking less than 40 ms even when provisioning tens of new task instances.
Streaming analysis is widely used in a variety of environments, from cloud computing infrastructures up to the networks edge. In these contexts, accurate modeling of streaming operators performance enables fine-grained prediction of applications behavior without the need of costly monitoring. This is of utmost importance for computationally-expensive operators like stream joins, that observe throughput and latency very sensitive to rate-varying data streams, especially when deterministic processing is required. In this paper, we present a modeling framework for estimating the throughput and the latency of stream join processing. The model is presented in an incremental step-wise manner, starting from a centralized non-deterministic stream join and expanding up to a deterministic parallel stream join. The model describes how the dynamics of throughput and latency are influenced by the number of physical input streams, as well as by the amount of parallelism in the actual processing and the requirement for determinism. We present an experimental validation of the model with respect to the actual implementation. The proposed model can provide insights that are catalytic for understanding the behavior of stream joins against different system deployments, with special emphasis on the influences of determinism and parallelization.
The construction industry is increasingly equipping its machinery with sophisticated embedded systems and modern connectivity. Technology advancements in connected safety-critical systems are complex, with cyber-security becoming a more critical factor. Due to interdependencies and network connectivity, attack surfaces and vulnerabilities have increased significantly. Consequently, it is imperative to perform a risk assessment and implement robust security testing methods in order to prevent cyber-attacks on machinery segments. This paper presents a method for identifying potential security threats that also affect machine functional safety, facilitated by identifying threats in the threat modeling process and analyzing safety-security synergies. By identifying such risks, attack scenarios are created to simulate cyber-attacks and create test cases for validation. This approach integrates security testing into the current testing process by using penetration testing tools and utilizing a Hardware-in-the-Loop(HIL) test setup and it is verified with a simulated Denial of Service attack over a CAN network.
In construction machinery, connectivity delivers higher advantages in terms of higher productivity, lower costs, and most importantly safer work environment. As the machinery grows more dependent on internet-connected technologies, data security and product cybersecurity become more critical than ever. These machines have more cyber risks compared to other automotive segments since there are more complexities in software, larger after-market options, use more standardized SAE J1939 protocol, and connectivity through long-distance wireless communication channels (LTE interfaces for fleet management systems). Construction machinery also operates throughout the day, which means connected and monitored endlessly. Till today, construction machinery manufacturers are investigating the product cybersecurity challenges in threat monitoring, security testing, and establishing security governance and policies. There are limited security testing methodologies on SAE J1939 CAN protocols. There are several testing frameworks proposed for fuzz testing CAN networks according to [1]. This paper proposes security testing methods (Fuzzing, Pen testing) for in-vehicle communication protocols in construction machinery.
Elasticity is one of the main features of cloud computing allowing customers to scale their resources based on the workload. Many autoscalers have been proposed in the past decade to decide on behalf of cloud customers when and how to provision resources to a cloud application based on the workload utilizing cloud elasticity features. However, in prior work, when a new policy is proposed, it is seldom compared to the state-of-the-art, and is often compared only to static provisioning using a predefined quality of service target. This reduces the ability of cloud customers and of cloud operators to choose and deploy an autoscaling policy, as there is seldom enough analysis on the performance of the autoscalers in different operating conditions and with different applications. In our work, we conduct an experimental performance evaluation of autoscaling policies, using as application model workflows, a popular formalism for automating resource management for applications with well-defined yet complex structures. We present a detailed comparative study of general state-of-the-art autoscaling policies, along with two new workflow-specific policies. To understand the performance differences between the seven policies, we conduct various experiments and compare their performance in both pairwise and group comparisons. We report both individual and aggregated metrics. As many workflows have deadline requirements on the tasks, we study the effect of autoscaling on workflow deadlines. Additionally, we look into the effect of autoscaling on the accounted and hourly based charged costs, and we evaluate performance variability caused by the autoscaler selection for each group of workflow sizes. Our results highlight the trade-offs between the suggested policies, how they can impact meeting the deadlines, and how they perform in different operating conditions, thus enabling a better understanding of the current state-of-the-art.
Simplifying the task of resource management and scheduling for customers, while still delivering complex Quality-of-Service (QoS), is key to cloud computing. Many autoscaling policies have been proposed in the past decade to decide on behalf of cloud customers when and how to provision resources to a cloud application utilizing cloud elasticity features. However, in prior work, when a new policy is proposed, it is seldom compared to the state-of-the-art, and is often compared only to static provisioning using a predefined QoS target. This reduces the ability of cloud customers and of cloud operators to choose and deploy an autoscaling policy. In our work, we conduct an experimentalperformance evaluation of autoscaling policies, using as application model workflows, a commonly used formalism for automating resource management for applications with well-defined yet complex structure. We present a detailed comparative study of general state-of-the-art autoscaling policies, along with two new workflow-specific policies. To understand the performance differences between the 7 policies, we conduct various forms of pairwise and group comparisons. We report both individual and aggregated metrics. Our results highlight the trade-offs between the suggested policies, and thus enable a better understanding of the current state-of-the-art.
This paper proposes a compositional modeling framework for the optimal energy management of a district network. The focus is on cooling of buildings, which can possibly share resources to the purpose of reducing maintenance costs and using devices at their maximal efficiency. Components of the network are described in terms of energy fluxes and combined via energy balance equations. Disturbances are accounted for as well, through their contribution in terms of energy. Different district configurations can be built, and the dimension and complexity of the resulting model will depend both on the number and type of components and on the adopted disturbance description. Control inputs are available to efficiently operate and coordinate the district components, thus enabling energy management strategies to minimize the electrical energy costs or track some consumption profile agreed with the main grid operator.
As the next generation of diverse workloads like autonomous driving and augmented/virtual reality evolves, computation is shifting from cloud-based services to the edge, leading to the emergence of a cloud-edge compute continuum. This continuum promises a wide spectrum of deployment opportunities for workloads that can leverage the strengths of cloud (scalable infrastructure, high reliability) and edge (energy efficient, low latencies). Despite its promises, the continuum has only been studied in silos of various computing models, thus lacking strong end-to-end theoretical and engineering foundations for computing and resource management across the continuum. Consequently, devel-opers resort to ad hoc approaches to reason about performance and resource utilization of workloads in the continuum. In this work, we conduct a first-of-its-kind systematic study of various computing models, identify salient properties, and make a case to unify them under a compute continuum reference architecture. This architecture provides an end-to-end analysis framework for developers to reason about resource management, workload distribution, and performance analysis. We demonstrate the utility of the reference architecture by analyzing two popular continuum workloads, deep learning and industrial IoT. We have developed an accompanying deployment and benchmarking framework and first-order analytical model for quantitative reasoning of continuum workloads. The framework is open-sourced and available at https://github.com/atlarge-research/continuum.
Distributed control systems constitute the automation solution backbone in domains where downtime is costly. Redundancy reduces the risk of faults leading to unplanned downtime. The Industry 4.0 appetite to utilize the device-to-cloud continuum increases the interest in network-based hardware-agnostic controller software. Functionality, such as controller redundancy, must adhere to the new ground rules of pure network dependency. In a standby controller redundancy, only one controller is the active primary. When the primary fails, the backup takes over. A typical network-based failure detection uses a cyclic message with a known interval, a.k.a. a heartbeat. Such a failure detection interprets heartbeat absences as a failure of the supervisee; consequently, a network partitioning could be indistinguishable from a node failure. Hence, in a network partitioning situation, a conventional heartbeat-based failure detection causes more than one active controller in the redundancy set, resulting in inconsistent outputs. We present a failure detection algorithm that uses network reference points to prevent network partitioning from leading to dual primary controllers. In other words, a failure detection that prioritizes consistency before availability.