Refine
Has Fulltext
- yes (14)
Is part of the Bibliography
- yes (14)
Document Type
- Doctoral Thesis (14)
Language
- English (14)
Keywords
- Cloud Computing (5)
- Benchmarking (3)
- Leistungsbewertung (3)
- Auto-Scaling (2)
- Energieeffizienz (2)
- Energy Efficiency (2)
- Forecasting (2)
- Metrics (2)
- Modellierung (2)
- Prognose (2)
Institute
Today’s cloud data centers consume an enormous amount of energy, and energy consumption will rise in the future. An estimate from 2012 found that data centers consume about 30 billion watts of power, resulting in about 263TWh of energy usage per year. The energy consumption will rise to 1929TWh until 2030. This projected rise in energy demand is fueled by a growing number of services deployed in the cloud. 50% of enterprise workloads have been migrated to the cloud in the last decade so far. Additionally, an increasing number of devices are using the cloud to provide functionalities and enable data centers to grow. Estimates say more than 75 billion IoT devices will be in use by 2025.
The growing energy demand also increases the amount of CO2 emissions. Assuming a CO2-intensity of 200g CO2 per kWh will get us close to 227 billion tons of CO2. This emission is more than the emissions of all energy-producing power plants in Germany in 2020.
However, data centers consume energy because they respond to service requests that are fulfilled through computing resources. Hence, it is not the users and devices that consume the energy in the data center but the software that controls the hardware. While the hardware is physically consuming energy, it is not always responsible for wasting energy. The software itself plays a vital role in reducing the energy consumption and CO2 emissions of data centers. The scenario of our thesis is, therefore, focused on software development.
Nevertheless, we must first show developers that software contributes to energy consumption by providing evidence of its influence. The second step is to provide methods to assess an application’s power consumption during different phases of the development process and to allow modern DevOps and agile development methods. We, therefore, need to have an automatic selection of system-level energy-consumption models that can accommodate rapid changes in the source code and application-level models allowing developers to locate power-consuming software parts for constant improvements. Afterward, we need emulation to assess the energy efficiency before the actual deployment.
In today's world, circumstances, processes, and requirements for systems in general-in this thesis a special focus is given to the context of Cyber-Physical Systems (CPS)-are becoming increasingly complex and dynamic.
In order to operate properly in such dynamic environments, systems must adapt to dynamic changes, which has led to the research area of Self-Adaptive Systems (SAS).
These systems can deal with changes in their environment and the system itself.
In our daily lives, we come into contact with many different self-adaptive systems that are designed to support and improve our way of life.
In this work we focus on the two domains Intelligent Transportation Systems (ITS) and logistics as both domains provide complex and adaptable use cases to prototypical apply the contributions of this thesis.
However, the contributions are not limited to these areas and can be generalized also to other domains such as the general area of CPS and Internet of Things including smart grids or even intelligent computer networks.
In ITS, real-time traffic control is an example adaptive system that monitors the environment, analyzes observations, and plans and executes adaptation actions.
Another example is platooning, which is the ability of vehicles to drive with close inter-vehicle distances.
This technology enables an increase in road throughput and safety, which directly addresses the increased infrastructure needs due to increased traffic on the roads.
In logistics, the Vehicle Routing Problem (VRP) deals with the planning of road freight transport tours.
To cope with the ever-increasing transport volume due to the rise of just-in-time production and online shopping, efficient and correct route planning for transports is important.
Further, warehouses play a central role in any company's supply chain and contribute to the logistical success.
The processes of storage assignment and order picking are the two main tasks in mezzanine warehouses highly affected by a dynamic environment.
Usually, optimization algorithms are applied to find solutions in reasonable computation time.
SASes can help address these dynamics by allowing systems to deal with changing demands and constraints.
For the application of SASes in the two areas ITS and logistics, the definition of adaptation planning strategies is the key success factor.
A wide range of adaptation planning strategies for different domains can be found in the literature, and the operator must select the most promising strategy for the problem at hand.
However, the No-Free-Lunch theorem states that the performance of one strategy is not necessarily transferable to other problems.
Accordingly, the algorithm selection problem, first defined in 1976, aims to find the best performing algorithm for the current problem.
Since then, this problem has been explored more and more, and the machine learning community, for example, considers it a learning problem.
The ideas surrounding the algorithm selection problem have been applied in various use cases, but little research has been done to generalize the approaches.
Moreover, especially in the field of SASes, the selection of the most appropriate strategy depends on the current situation of the system.
Techniques for identifying the situation of a system can be found in the literature, such as the use of rules or clustering techniques.
This knowledge can then be used to improve the algorithm selection, or in the scope of this thesis, to improve the selection of adaptation planning strategies.
In addition, knowledge about the current situation and the performance of strategies in similar previously observed situations provides another opportunity for improvements.
This ongoing learning and reasoning about the system and its environment is found in the research area Self-Aware Computing (SeAC).
In this thesis, we explore common characteristics of adaptation planning strategies in the domain of ITS and logistics presenting a self-aware optimization framework for adaptation planning strategies.
We consider platooning coordination strategies from ITS and optimization techniques from logistics as adaptation planning strategies that can be exchanged during operation to better reflect the current situation.
Further, we propose to integrate fairness and uncertainty handling mechanisms directly into the adaptation planning strategies.
We then examine the complex structure of the logistics use cases VRP and mezzanine warehouses and identify their systems-of-systems structure.
We propose a two-stage approach for vertical or nested systems and propose to consider the impact of intertwining horizontal or coexisting systems.
More specifically, we summarize the six main contributions of this thesis as follows:
First, we analyze specific characteristics of adaptation planning strategies with a particular focus on ITS and logistics.
We use platooning and route planning in highly dynamic environments as representatives of ITS and we use the rich Vehicle Routing Problem (rVRP) and mezzanine warehouses as representatives of the logistics domain.
Using these case studies, we derive the need for situation-aware optimization of adaptation planning strategies and argue that fairness is an important consideration when applying these strategies in ITS.
In logistics, we discuss that these complex systems can be considered as systems-of-systems and this structure affects each subsystem.
Hence, we argue that the consideration of these characteristics is a crucial factor for the success of the system.
Second, we design a self-aware optimization framework for adaptation planning strategies.
The optimization framework is abstracted into a third layer above the application and its adaptation planning system, which allows the concept to be applied to a diverse set of use cases.
Further, the Domain Data Model (DDM) used to configure the framework enables the operator to easily apply it by defining the available adaptation planning strategies, parameters to be optimized, and performance measures.
The framework consists of four components: (i) Coordination, (ii) Situation Detection, (iii) Strategy Selection, and (iv) Parameter Optimization.
While the coordination component receives observations and triggers the other components, the situation detection applies rules or clustering techniques to identify the current situation.
The strategy selection uses this knowledge to select the most promising strategy for the current situation, and the parameter optimization applies optimization algorithms to tune the parameters of the strategy.
Moreover, we apply the concepts of the SeAC domain and integrate learning and reasoning processes to enable ongoing advancement of the framework.
We evaluate our framework using the platooning use case and consider platooning coordination strategies as the adaptation planning strategies to be selected and optimized.
Our evaluation shows that the framework is able to select the most appropriate adaptation strategy and learn the situational behavior of the system.
Third, we argue that fairness aspects, previously identified as an important characteristic of adaptation planning strategies, are best addressed directly as part of the strategies.
Hence, focusing on platooning as an example use case, we propose a set of fairness mechanisms to balance positive and negative effects of platooning among all participants in a platoon.
We design six vehicle sequence rotation mechanisms that continuously change the leader position among all participants, as this is the position with the least positive effects.
We analyze these strategies on roads of different sizes and with different traffic volumes, and show that these mechanisms should also be chosen wisely.
Fourth, we address the uncertainty characteristic of adaptation planning strategies and propose a methodology to account for uncertainty and also address it directly as part of the adaptation planning strategies.
We address the use case of fueling planning along a route associated with highly dynamic fuel prices and develop six utility functions that account for different aspects of route planning.
Further, we incorporate uncertainty measures for dynamic fuel prices by adding penalties for longer travel times or greater distance to the next gas station.
Through this approach, we are able to reduce the uncertainty at planning time and obtain a more robust route planning.
Fifth, we analyze optimization of nested systems-of-systems for the use case rVRP.
Before proposing an approach to deal with the complex structure of the problem, we analyze important constraints and objectives that need to be considered when formulating a real-world rVRP.
Then, we propose a two-stage workflow to optimize both systems individually, flexibly, and interchangeably.
We apply Genetic Algorithms and Ant Colony Optimization (ACO) to both nested systems and compare the performance of our workflow with state-of-the-art optimization algorithms for this use case.
In our evaluation, we show that the proposed two-stage workflow is able to handle the complex structure of the problem and consider all real-world constraints and objectives.
Finally, we study coexisting systems-of-systems by optimizing typical processes in mezzanine warehouses.
We first define which ergonomic and economic constraints and objectives must be considered when addressing a real-world problem.
Then, we analyze the interrelatedness of the storage assignment and order picking problems; we identify opportunities to design optimization approaches that optimize all objectives and aim for a good overall system performance, taking into account the interdependence of both systems.
We use the NSGA-II for storage assignment and Ant Colony Optimization (ACO) for order picking and adapt them to the specific requirements of horizontal systems-of-systems.
In our evaluation, we compare our approaches to state-of-the-art approaches in mezzanine warehouses and show that our proposed approaches increase the system performance.
Our proposed approaches provide important contributions to both academic research and practical applications.
To the best of our knowledge, we are the first to design a self-aware optimization framework for adaptation planning strategies that integrates situation-awareness, algorithm selection, parameter tuning, as well as learning and reasoning.
Our evaluation of platooning coordination shows promising results for the application of the framework.
Moreover, our proposed strategies to compensate for negative effects of platooning represent an important milestone, which could lead to higher acceptance of this technology in society and support its future adoption in the real world.
The proposed methodology and utility functions that address uncertainty are an important step to improving the capabilities of SAS in an increasingly turbulent environment.
Similarly, our contributions to systems-of-systems optimization are major contributions to the state of logistics and systems-of-systems research.
Finally, we select real-world use cases for the application of our approaches and cooperate with industrial partners, which highlights the practical relevance of our contributions.
The reduction of manual effort and required expert knowledge in our self-aware optimization framework is a milestone in bridging the gap between academia and practice.
One of our partners integrated the two-stage approach to tackling the rVRP into its software system, improving both time to solution and solution quality.
In conclusion, the contributions of this thesis have spawned several research projects such as a long-term industrial project on optimizing tours and routes in parcel delivery funded by Bayerisches Verbundforschungsprogramm (BayVFP) – Digitalisierung and further collaborations, opening up many promising avenues for future research.
The importance of proactive and timely prediction of critical events is steadily increasing, whether in the manufacturing industry or in private life. In the past, machines in the manufacturing industry were often maintained based on a regular schedule or threshold violations, which is no longer competitive as it causes unnecessary costs and downtime. In contrast, the predictions of critical events in everyday life are often much more concealed and hardly noticeable to the private individual, unless the critical event occurs. For instance, our electricity provider has to ensure that we, as end users, are always supplied with sufficient electricity, or our favorite streaming service has to guarantee that we can watch our favorite series without interruptions. For this purpose, they have to constantly analyze what the current situation is, how it will develop in the near future, and how they have to react in order to cope with future conditions without causing power outages or video stalling.
In order to analyze the performance of a system, monitoring mechanisms are often integrated to observe characteristics that describe the workload and the state of the system and its environment. Reactive systems typically employ thresholds, utility functions, or models to determine the current state of the system. However, such reactive systems cannot proactively estimate future events, but only as they occur. In the case of critical events, reactive determination of the current system state is futile, whereas a proactive system could have predicted this event in advance and enabled timely countermeasures. To achieve proactivity, the system requires estimates of future system states. Given the gap between design time and runtime, it is typically not possible to use expert knowledge to a priori model all situations a system might encounter at runtime. Therefore, prediction methods must be integrated into the system. Depending on the available monitoring data and the complexity of the prediction task, either time series forecasting in combination with thresholding or more sophisticated machine and deep learning models have to be trained.
Although numerous forecasting methods have been proposed in the literature, these methods have their advantages and disadvantages depending on the characteristics of the time series under consideration. Therefore, expert knowledge is required to decide which forecasting method to choose. However, since the time series observed at runtime cannot be known at design time, such expert knowledge cannot be implemented in the system. In addition to selecting an appropriate forecasting method, several time series preprocessing steps are required to achieve satisfactory forecasting accuracy. In the literature, this preprocessing is often done manually, which is not practical for autonomous computing systems, such as Self-Aware Computing Systems. Several approaches have also been presented in the literature for predicting critical events based on multivariate monitoring data using machine and deep learning. However, these approaches are typically highly domain-specific, such as financial failures, bearing failures, or product failures. Therefore, they require in-depth expert knowledge. For this reason, these approaches cannot be fully automated and are not transferable to other use cases. Thus, the literature lacks generalizable end-to-end workflows for modeling, detecting, and predicting failures that require only little expert knowledge.
To overcome these shortcomings, this thesis presents a system model for meta-self-aware prediction of critical events based on the LRA-M loop of Self-Aware Computing Systems. Building upon this system model, this thesis provides six further contributions to critical event prediction. While the first two contributions address critical event prediction based on univariate data via time series forecasting, the three subsequent contributions address critical event prediction for multivariate monitoring data using machine and deep learning algorithms. Finally, the last contribution addresses the update procedure of the system model. Specifically, the seven main contributions of this thesis can be summarized as follows:
First, we present a system model for meta self-aware prediction of critical events. To handle both univariate and multivariate monitoring data, it offers univariate time series forecasting for use cases where a single observed variable is representative of the state of the system, and machine learning algorithms combined with various preprocessing techniques for use cases where a large number of variables are observed to characterize the system’s state. However, the two different modeling alternatives are not disjoint, as univariate time series forecasts can also be included to estimate future monitoring data as additional input to the machine learning models. Finally, a feedback loop is incorporated to monitor the achieved prediction quality and trigger model updates.
We propose a novel hybrid time series forecasting method for univariate, seasonal time series, called Telescope. To this end, Telescope automatically preprocesses the time series, performs a kind of divide-and-conquer technique to split the time series into multiple components, and derives additional categorical information. It then forecasts the components and categorical information separately using a specific state-of-the-art method for each component. Finally, Telescope recombines the individual predictions. As Telescope performs both preprocessing and forecasting automatically, it represents a complete end-to-end approach to univariate seasonal time series forecasting. Experimental results show that Telescope achieves enhanced forecast accuracy, more reliable forecasts, and a substantial speedup. Furthermore, we apply Telescope to the scenario of predicting critical events for virtual machine auto-scaling. Here, results show that Telescope considerably reduces the average response time and significantly reduces the number of service level objective violations.
For the automatic selection of a suitable forecasting method, we introduce two frameworks for recommending forecasting methods. The first framework extracts various time series characteristics to learn the relationship between them and forecast accuracy. In contrast, the other framework divides the historical observations into internal training and validation parts to estimate the most appropriate forecasting method. Moreover, this framework also includes time series preprocessing steps. Comparisons between the proposed forecasting method recommendation frameworks and the individual state-of-the-art forecasting methods and the state-of-the-art forecasting method recommendation approach show that the proposed frameworks considerably improve the forecast accuracy.
With regard to multivariate monitoring data, we first present an end-to-end workflow to detect critical events in technical systems in the form of anomalous machine states. The end-to-end design includes raw data processing, phase segmentation, data resampling, feature extraction, and machine tool anomaly detection. In addition, the workflow does not rely on profound domain knowledge or specific monitoring variables, but merely assumes standard machine monitoring data. We evaluate the end-to-end workflow using data from a real CNC machine. The results indicate that conventional frequency analysis does not detect the critical machine conditions well, while our workflow detects the critical events very well with an F1-score of almost 91%.
To predict critical events rather than merely detecting them, we compare different modeling alternatives for critical event prediction in the use case of time-to-failure prediction of hard disk drives. Given that failure records are typically significantly less frequent than instances representing the normal state, we employ different oversampling strategies. Next, we compare the prediction quality of binary class modeling with downscaled multi-class modeling. Furthermore, we integrate univariate time series forecasting into the feature generation process to estimate future monitoring data. Finally, we model the time-to-failure using not only classification models but also regression models. The results suggest that multi-class modeling provides the overall best prediction quality with respect to practical requirements. In addition, we prove that forecasting the features of the prediction model significantly improves the critical event prediction quality.
We propose an end-to-end workflow for predicting critical events of industrial machines. Again, this approach does not rely on expert knowledge except for the definition of monitoring data, and therefore represents a generalizable workflow for predicting critical events of industrial machines. The workflow includes feature extraction, feature handling, target class mapping, and model learning with integrated hyperparameter tuning via a grid-search technique. Drawing on the result of the previous contribution, the workflow models the time-to-failure prediction in terms of multiple classes, where we compare different labeling strategies for multi-class classification. The evaluation using real-world production data of an industrial press demonstrates that the workflow is capable of predicting six different time-to-failure windows with a macro F1-score of 90%. When scaling the time-to-failure classes down to a binary prediction of critical events, the F1-score increases to above 98%.
Finally, we present four update triggers to assess when critical event prediction models should be re-trained during on-line application. Such re-training is required, for instance, due to concept drift. The update triggers introduced in this thesis take into account the elapsed time since the last update, the prediction quality achieved on the current test data, and the prediction quality achieved on the preceding test data. We compare the different update strategies with each other and with the static baseline model. The results demonstrate the necessity of model updates during on-line application and suggest that the update triggers that consider both the prediction quality of the current and preceding test data achieve the best trade-off between prediction quality and number of updates required.
We are convinced that the contributions of this thesis constitute significant impulses for the academic research community as well as for practitioners. First of all, to the best of our knowledge, we are the first to propose a fully automated, end-to-end, hybrid, component-based forecasting method for seasonal time series that also includes time series preprocessing. Due to the combination of reliably high forecast accuracy and reliably low time-to-result, it offers many new opportunities in applications requiring accurate forecasts within a fixed time period in order to take timely countermeasures. In addition, the promising results of the forecasting method recommendation systems provide new opportunities to enhance forecasting performance for all types of time series, not just seasonal ones. Furthermore, we are the first to expose the deficiencies of the prior state-of-the-art forecasting method recommendation system.
Concerning the contributions to critical event prediction based on multivariate monitoring data, we have already collaborated closely with industrial partners, which supports the practical relevance of the contributions of this thesis. The automated end-to-end design of the proposed workflows that do not demand profound domain or expert knowledge represents a milestone in bridging the gap between academic theory and industrial application. Finally, the workflow for predicting critical events in industrial machines is currently being operationalized in a real production system, underscoring the practical impact of this thesis.
Enterprise applications in virtualized data centers are often subject to time-varying workloads, i.e., the load intensity and request mix change over time, due to seasonal patterns and trends, or unpredictable bursts in user requests. Varying workloads result in frequently changing resource demands to the underlying hardware infrastructure. Virtualization technologies enable sharing and on-demand allocation of hardware resources between multiple applications. In this context, the resource allocations to virtualized applications should be continuously adapted in an elastic fashion, so that "at each point in time the available resources match the current demand as closely as possible" (Herbst el al., 2013). Autonomic approaches to resource management promise significant increases in resource efficiency while avoiding violations of performance and availability requirements during peak workloads.
Traditional approaches for autonomic resource management use threshold-based rules (e.g., Amazon EC2) that execute pre-defined reconfiguration actions when a metric reaches a certain threshold (e.g., high resource utilization or load imbalance). However, many business-critical applications are subject to Service-Level-Objectives defined on an application performance metric (e.g., response time or throughput). To determine thresholds so that the end-to-end application SLO is fulfilled poses a major challenge due to the complex relationship between the resource allocation to an application and the application performance. Furthermore, threshold-based approaches are inherently prone to an oscillating behavior resulting in unnecessary reconfigurations.
In order to overcome the deficiencies of threshold-based
approaches and enable a fully automated approach to dynamically control the resource allocations of virtualized applications, model-based approaches are required that can predict the impact of a reconfiguration on the application performance in advance. However, existing model-based approaches are severely limited in their learning capabilities. They either require complete performance models of the application as input, or use a pre-identified model structure and only learn certain model parameters from empirical data at run-time. The former requires high manual efforts and deep system knowledge to create the performance models. The latter does not provide the flexibility to capture the specifics of complex and heterogeneous system architectures.
This thesis presents a self-aware approach to the resource management in virtualized data centers. In this context, self-aware means that it automatically learns performance models of the application and the virtualized infrastructure and reasons based on these models to autonomically adapt the resource allocations in accordance with given application SLOs. Learning a performance model requires the extraction of the model structure representing the system architecture as well as the estimation of model parameters, such as resource demands. The estimation of resource demands is a key challenge as they cannot be observed directly in most systems.
The major scientific contributions of this thesis are:
- A reference architecture for online model learning in virtualized systems. Our reference architecture is based on a set of model extraction agents. Each agent focuses on specific tasks to automatically create and update model skeletons capturing its local knowledge of the system and collaborates with other agents to extract the structural parts of a global performance model of the system. We define different agent roles in the reference architecture and propose a model-based collaboration mechanism for the agents. The agents may be bundled within virtual appliances and may be tailored to include knowledge about the software stack deployed in a specific virtual appliance.
- An online method for the statistical estimation of resource demands. For a given request processed by an application, the resource time consumed for a specified resource within the system (e.g., CPU or I/O device), referred to as resource demand, is the total average time the resource is busy processing the request. A request could be any unit of work (e.g., web page request, database transaction, batch job) processed by the system. We provide a systematization of existing statistical approaches to resource demand estimation and conduct an extensive experimental comparison to evaluate the accuracy of these approaches. We propose a novel method to automatically select estimation approaches and demonstrate that it increases the robustness and accuracy of the estimated resource demands significantly.
- Model-based controllers for autonomic vertical scaling of virtualized applications. We design two controllers based on online model-based reasoning techniques in order to vertically scale applications at run-time in accordance with application SLOs. The controllers exploit the knowledge from the automatically extracted performance models when determining necessary reconfigurations. The first controller adds and removes virtual CPUs to an application depending on the current demand. It uses a layered performance model to also consider the physical resource contention when determining the required resources. The second controller adapts the resource allocations proactively to ensure the availability of the application during workload peaks and avoid reconfiguration during phases of high workload.
We demonstrate the applicability of our approach in current virtualized environments and show its effectiveness leading to significant increases in resource efficiency and improvements of the application performance and availability under time-varying workloads. The evaluation of our approach is based on two case studies representative of widely used enterprise applications in virtualized data centers. In our case studies, we were able to reduce the amount of required CPU resources by up to 23% and the number of reconfigurations by up to 95% compared to a rule-based approach while ensuring full compliance with application SLO. Furthermore, using workload forecasting techniques we were able to schedule expensive reconfigurations (e.g., changes to the memory size) during phases of load load and thus were able to reduce their impact on application availability by over 80% while significantly improving application performance compared to a reactive controller. The methods and techniques for resource demand estimation and vertical application scaling were developed and evaluated in close collaboration with VMware and Google.
A key functionality of cloud systems are automated resource management mechanisms at the infrastructure level. As part of this, elastic scaling of allocated resources is realized by so-called auto-scalers that are supposed to match the current demand in a way that the performance remains stable while resources are efficiently used.
The process of rating cloud infrastructure offerings in terms of the quality of their achieved elastic scaling remains undefined. Clear guidance for the selection and configuration of an auto-scaler for a given context is not available. Thus, existing operating solutions are optimized in a highly application specific way and usually kept undisclosed.
The common state of practice is the use of simplistic threshold-based approaches. Due to their reactive nature they incur performance degradation during the minutes of provisioning delays. In the literature, a high-number of auto-scalers has been proposed trying to overcome the limitations of reactive mechanisms by employing proactive prediction methods.
In this thesis, we identify potentials in automated cloud system resource management and its evaluation methodology. Specifically, we make the following contributions:
We propose a descriptive load profile modeling framework together with automated model extraction from recorded traces to enable reproducible workload generation with realistic load intensity variations. The proposed Descartes Load Intensity Model (DLIM) with its Limbo framework provides key functionality to stress and benchmark resource management approaches in a representative and fair manner.
We propose a set of intuitive metrics for quantifying timing, stability and accuracy aspects of elasticity. Based on these metrics, we propose a novel approach for benchmarking the elasticity of Infrastructure-as-a-Service (IaaS) cloud platforms independent of the performance exhibited by the provisioned underlying resources.
We tackle the challenge of reducing the risk of relying on a single proactive auto-scaler by proposing a new self-aware auto-scaling mechanism, called Chameleon, combining multiple different proactive methods coupled with a reactive fallback mechanism.
Chameleon employs on-demand, automated time series-based forecasting methods to predict the arriving load intensity in combination with run-time service demand estimation techniques to calculate the required resource consumption per work unit without the need for a detailed application instrumentation. It can also leverage application knowledge by solving product-form queueing networks used to derive optimized scaling actions. The Chameleon approach is first in resolving conflicts between reactive and proactive scaling decisions in an intelligent way.
We are confident that the contributions of this thesis will have a long-term impact on the way cloud resource management approaches are assessed. While this could result in an improved quality of autonomic management algorithms, we see and discuss arising challenges for future research in cloud resource management and its assessment methods: The adoption of containerization on top of virtual machine instances introduces another level of indirection. As a result, the nesting of virtual resources increases resource fragmentation and causes unreliable provisioning delays. Furthermore, virtualized compute resources tend to become more and more inhomogeneous associated with various priorities and trade-offs. Due to DevOps practices, cloud hosted service updates are released with a higher frequency which impacts the dynamics in user behavior.
Software frameworks for Realtime Interactive Systems (RIS), e.g., in the areas of Virtual, Augmented, and Mixed Reality (VR, AR, and MR) or computer games, facilitate a multitude of functionalities by coupling diverse software modules. In this context, no uniform methodology for coupling these modules does exist; instead various purpose-built solutions have been proposed. As a consequence, important software qualities, such as maintainability, reusability, and adaptability, are impeded.
Many modern systems provide additional support for the integration of Artificial Intelligence (AI) methods to create so called intelligent virtual environments. These methods exacerbate the above-mentioned problem of coupling software modules in the thus created Intelligent Realtime Interactive Systems (IRIS) even more. This, on the one hand, is due to the commonly applied specialized data structures and asynchronous execution schemes, and the requirement for high consistency regarding content-wise coupled but functionally decoupled forms of data representation on the other.
This work proposes an approach to decoupling software modules in IRIS, which is based on the abstraction of architecture elements using a semantic Knowledge Representation Layer (KRL). The layer facilitates decoupling the required modules, provides a means for ensuring interface compatibility and consistency, and in the end constitutes an interface for symbolic AI methods.
Virtualization allows the creation of virtual instances of physical devices, such as network and processing units. In a virtualized system, governed by a hypervisor, resources are shared among virtual machines (VMs). Virtualization has been receiving increasing interest as away to reduce costs through server consolidation and to enhance the flexibility of physical infrastructures. Although virtualization provides many benefits, it introduces new security challenges; that is, the introduction of a hypervisor introduces threats since hypervisors expose new attack surfaces.
Intrusion detection is a common cyber security mechanism whose task is to detect malicious activities in host and/or network environments. This enables timely reaction in order to stop an on-going attack, or to mitigate the impact of a security breach. The wide adoption of virtualization has resulted in the increasingly common practice of deploying conventional intrusion detection systems (IDSs), for example, hardware IDS appliances or common software-based IDSs, in designated VMs as virtual network functions (VNFs). In addition, the research and industrial communities have developed IDSs specifically designed to operate in virtualized environments (i.e., hypervisorbased IDSs), with components both inside the hypervisor and in a designated VM. The latter are becoming increasingly common with the growing proliferation of virtualized data centers and the adoption of the cloud computing paradigm, for which virtualization is as a key enabling technology.
To minimize the risk of security breaches, methods and techniques for evaluating IDSs in an accurate manner are essential. For instance, one may compare different IDSs in terms of their attack detection accuracy in order to identify and deploy the IDS that operates optimally in a given environment, thereby reducing the risks of a security breach. However, methods and techniques for realistic and accurate evaluation of the attack detection accuracy of IDSs in virtualized environments (i.e., IDSs deployed as VNFs or hypervisor-based IDSs) are lacking. That is, workloads that exercise the sensors of an evaluated IDS and contain attacks targeting hypervisors are needed. Attacks targeting hypervisors are of high severity since they may result in, for example, altering the hypervisors’s memory and thus enabling the execution of malicious code with hypervisor privileges. In addition, there are no metrics and measurement methodologies
for accurately quantifying the attack detection accuracy of IDSs in virtualized environments with elastic resource provisioning (i.e., on-demand allocation or deallocation of virtualized hardware resources to VMs). Modern hypervisors allow for hotplugging virtual CPUs and memory on the designated VM where the intrusion detection engine of hypervisor-based IDSs, as well as of IDSs deployed as VNFs, typically operates. Resource hotplugging may have a significant impact on the attack detection accuracy of an evaluated IDS, which is not taken into account by existing metrics for quantifying IDS attack detection accuracy. This may lead to inaccurate measurements, which, in turn, may result in the deployment of misconfigured or ill-performing IDSs, increasing
the risk of security breaches.
This thesis presents contributions that span the standard components of any system
evaluation scenario: workloads, metrics, and measurement methodologies. The scientific contributions of this thesis are:
A comprehensive systematization of the common practices and the state-of-theart on IDS evaluation. This includes: (i) a definition of an IDS evaluation design space allowing to put existing practical and theoretical work into a common context in a systematic manner; (ii) an overview of common practices in IDS evaluation reviewing evaluation approaches and methods related to each part of the design space; (iii) and a set of case studies demonstrating how different IDS evaluation approaches are applied in practice. Given the significant amount of existing practical and theoretical work related to IDS evaluation, the presented systematization is beneficial for improving the general understanding of the topic by providing an overview of the current state of the field. In addition, it is beneficial for identifying and contrasting advantages and disadvantages of different IDS evaluation methods and practices, while also helping to identify specific requirements and best practices for evaluating current and future IDSs.
An in-depth analysis of common vulnerabilities of modern hypervisors as well as a set of attack models capturing the activities of attackers triggering these vulnerabilities. The analysis includes 35 representative vulnerabilities of hypercall handlers (i.e., hypercall vulnerabilities). Hypercalls are software traps from a kernel of a VM to the hypervisor. The hypercall interface of hypervisors, among device drivers and VM exit events, is one of the attack surfaces that hypervisors expose. Triggering a hypercall vulnerability may lead to a crash of the hypervisor or to altering the hypervisor’s memory. We analyze the origins
of the considered hypercall vulnerabilities, demonstrate and analyze possible attacks that trigger them (i.e., hypercall attacks), develop hypercall attack models(i.e., systematized activities of attackers targeting the hypercall interface), and discuss future research directions focusing on approaches for securing hypercall interfaces.
A novel approach for evaluating IDSs enabling the generation of workloads that contain attacks targeting hypervisors, that is, hypercall attacks. We propose an approach for evaluating IDSs using attack injection (i.e., controlled execution of attacks during regular operation of the environment where an IDS under test is deployed). The injection of attacks is performed based on attack models that capture realistic attack scenarios. We use the hypercall attack models developed as part of this thesis for injecting hypercall attacks.
A novel metric and measurement methodology for quantifying the attack detection accuracy of IDSs in virtualized environments that feature elastic resource provisioning. We demonstrate how the elasticity of resource allocations in such environments may impact the IDS attack detection accuracy and show that using existing metrics in such environments may lead to practically challenging and inaccurate measurements. We also demonstrate the practical use of the metric we propose through a set of case studies, where we evaluate common conventional IDSs deployed as VNFs.
In summary, this thesis presents the first systematization of the state-of-the-art on IDS evaluation, considering workloads, metrics and measurement methodologies as integral parts of every IDS evaluation approach. In addition, we are the first to examine the hypercall attack surface of hypervisors in detail and to propose an approach using attack injection for evaluating IDSs in virtualized environments. Finally, this thesis presents the first metric and measurement methodology for quantifying the attack detection accuracy of IDSs in virtualized environments that feature elastic resource provisioning.
From a technical perspective, as part of the proposed approach for evaluating IDSsthis thesis presents hInjector, a tool for injecting hypercall attacks. We designed hInjector to enable the rigorous, representative, and practically feasible evaluation of IDSs using attack injection. We demonstrate the application and practical usefulness of hInjector, as well as of the proposed approach, by evaluating a representative hypervisor-based IDS designed to detect hypercall attacks. While we focus on evaluating the capabilities of IDSs to detect hypercall attacks, the proposed IDS evaluation approach can be generalized and applied in a broader context. For example, it may be directly used to also evaluate security mechanisms of hypervisors, such as hypercall access control (AC) mechanisms. It may also be applied to evaluate the capabilities
of IDSs to detect attacks involving operations that are functionally similar to hypercalls,
for example, the input/output control (ioctl) calls that the Kernel-based Virtual Machine (KVM) hypervisor supports. For IDSs in virtualized environments featuring elastic resource provisioning, our approach for injecting hypercall attacks can be applied in combination with the attack detection accuracy metric and measurement methodology we propose. Our approach for injecting hypercall attacks, and our metric and measurement methodology, can also be applied independently beyond the scenarios considered in this thesis. The wide spectrum of security mechanisms in virtualized environments whose evaluation can directly benefit from the contributions of this thesis (e.g., hypervisor-based IDSs, IDSs deployed as VNFs, and AC mechanisms) reflects the practical implication of the thesis.
Energy efficiency of computing systems has become an increasingly important issue over the last decades. In 2015, data centers were responsible for 2% of the world's greenhouse gas emissions, which is roughly the same as the amount produced by air travel.
In addition to these environmental concerns, power consumption of servers in data centers results in significant operating costs, which increase by at least 10% each year.
To address this challenge, the U.S. EPA and other government agencies are considering the use of novel measurement methods in order to label the energy efficiency of servers.
The energy efficiency and power consumption of a server is subject to a great number of factors, including, but not limited to, hardware, software stack, workload, and load level.
This huge number of influencing factors makes measuring and rating of energy efficiency challenging. It also makes it difficult to find an energy-efficient server for a specific use-case. Among others, server provisioners, operators, and regulators would profit from information on the servers in question and on the factors that affect those servers' power consumption and efficiency. However, we see a lack of measurement methods and metrics for energy efficiency of the systems under consideration.
Even assuming that a measurement methodology existed, making decisions based on its results would be challenging. Power prediction methods that make use of these results would aid in decision making. They would enable potential server customers to make better purchasing decisions and help operators predict the effects of potential reconfigurations.
Existing energy efficiency benchmarks cannot fully address these challenges, as they only measure single applications at limited sets of load levels. In addition, existing efficiency metrics are not helpful in this context, as they are usually a variation of the simple performance per power ratio, which is only applicable to single workloads at a single load level. Existing data center efficiency metrics, on the other hand, express the efficiency of the data center space and power infrastructure, not focusing on the efficiency of the servers themselves. Power prediction methods for not-yet-available systems that could make use of the results provided by a comprehensive power rating methodology are also lacking. Existing power prediction models for hardware designers have a very fine level of granularity and detail that would not be useful for data center operators.
This thesis presents a measurement and rating methodology for energy efficiency of servers and an energy efficiency metric to be applied to the results of this methodology. We also design workloads, load intensity and distribution models, and mechanisms that can be used for energy efficiency testing. Based on this, we present power prediction mechanisms and models that utilize our measurement methodology and its results for power prediction.
Specifically, the six major contributions of this thesis are:
We present a measurement methodology and metrics for energy efficiency rating of servers that use multiple, specifically chosen workloads at different load levels for a full system characterization.
We evaluate the methodology and metric with regard to their reproducibility, fairness, and relevance. We investigate the power and performance variations of test results and show fairness of the metric through a mathematical proof and a correlation analysis on a set of 385 servers. We evaluate the metric's relevance by showing the relationships that can be established between metric results and third-party applications.
We create models and extraction mechanisms for load profiles that vary over time, as well as load distribution mechanisms and policies. The models are designed to be used to define arbitrary dynamic load intensity profiles that can be leveraged for benchmarking purposes. The load distribution mechanisms place workloads on computing resources in a hierarchical manner.
Our load intensity models can be extracted in less than 0.2 seconds and our resulting models feature a median modeling error of 12.7% on average. In addition, our new load distribution strategy can save up to 10.7% of power consumption on a single server node.
We introduce an approach to create small-scale workloads that emulate the power consumption-relevant behavior of large-scale workloads by approximating their CPU performance counter profile, and we introduce TeaStore, a distributed, micro-service-based reference application. TeaStore can be used to evaluate power and performance model accuracy, elasticity of cloud auto-scalers, and the effectiveness of power saving mechanisms for distributed systems.
We show that we are capable of emulating the power consumption behavior of realistic workloads with a mean deviation less than 10% and down to 0.2 watts (1%). We demonstrate the use of TeaStore in the context of performance model extraction and cloud auto-scaling also showing that it may generate workloads with different effects on the power consumption of the system under consideration.
We present a method for automated selection of interpolation strategies for performance and power characterization. We also introduce a configuration approach for polynomial interpolation functions of varying degrees that improves prediction accuracy for system power consumption for a given system utilization.
We show that, in comparison to regression, our automated interpolation method selection and configuration approach improves modeling accuracy by 43.6% if additional reference data is available and by 31.4% if it is not.
We present an approach for explicit modeling of the impact a virtualized environment has on power consumption and a method to predict the power consumption of a software application. Both methods use results produced by our measurement methodology to predict the respective power consumption for servers that are otherwise not available to the person making the prediction.
Our methods are able to predict power consumption reliably for multiple hypervisor configurations and for the target application workloads. Application workload power prediction features a mean average absolute percentage error of 9.5%.
Finally, we propose an end-to-end modeling approach for predicting the power consumption of component placements at run-time. The model can also be used to predict the power consumption at load levels that have not yet been observed on the running system.
We show that we can predict the power consumption of two different distributed web applications with a mean absolute percentage error of 2.2%. In addition, we can predict the power consumption of a system at a previously unobserved load level and component distribution with an error of 1.2%.
The contributions of this thesis already show a significant impact in science and industry. The presented efficiency rating methodology, including its metric, have been adopted by the U.S. EPA in the latest version of the ENERGY STAR Computer Server program. They are also being considered by additional regulatory agencies, including the EU Commission and the China National Institute of Standardization. In addition, the methodology's implementation and the underlying methodology itself have already found use in several research publications.
Regarding future work, we see a need for new workloads targeting specialized server hardware. At the moment, we are witnessing a shift in execution hardware to specialized machine learning chips, general purpose GPU computing, FPGAs being embedded into compute servers, etc. To ensure that our measurement methodology remains relevant, workloads covering these areas are required. Similarly, power prediction models must be extended to cover these new scenarios.
Nowadays, data centers are becoming increasingly dynamic due to the common adoption of virtualization technologies. Systems can scale their capacity on demand by growing and shrinking their resources dynamically based on the current load. However, the complexity and performance of modern data centers is influenced not only by the software architecture, middleware, and computing resources, but also by network virtualization, network protocols, network services, and configuration. The field of network virtualization is not as mature as server virtualization and there are multiple competing approaches and technologies. Performance modeling and prediction techniques provide a powerful tool to analyze the performance of modern data centers. However, given the wide variety of network virtualization approaches, no common approach exists for modeling and evaluating the performance of virtualized networks.
The performance community has proposed multiple formalisms and models for evaluating the performance of infrastructures based on different network virtualization technologies. The existing performance models can be divided into two main categories: coarse-grained analytical models and highly-detailed simulation models. Analytical performance models are normally defined at a high level of abstraction and thus they abstract many details of the real network and therefore have limited predictive power. On the other hand, simulation models are normally focused on a selected networking technology and take into account many specific performance influencing factors, resulting in detailed models that are tightly bound to a given technology, infrastructure setup, or to a given protocol stack.
Existing models are inflexible, that means, they provide a single solution method without providing means for the user to influence the solution accuracy and solution overhead. To allow for flexibility in the performance prediction, the user is required to build multiple different performance models obtaining multiple performance predictions. Each performance prediction may then have different focus, different performance metrics, prediction accuracy, and solving time.
The goal of this thesis is to develop a modeling approach that does not require the user to have experience in any of the applied performance modeling formalisms. The approach offers the flexibility in the modeling and analysis by balancing between: (a) generic character and low overhead of coarse-grained analytical models, and (b) the more detailed simulation models with higher prediction accuracy.
The contributions of this thesis intersect with technologies and research areas, such as: software engineering, model-driven software development, domain-specific modeling, performance modeling and prediction, networking and data center networks, network virtualization, Software-Defined Networking (SDN), Network Function Virtualization (NFV). The main contributions of this thesis compose the Descartes Network Infrastructure (DNI) approach and include:
• Novel modeling abstractions for virtualized network infrastructures. This includes two meta-models that define modeling languages for modeling data center network performance. The DNI and miniDNI meta-models provide means for representing network infrastructures at two different abstraction levels. Regardless of which variant of the DNI meta-model is used, the modeling language provides generic modeling elements allowing to describe the majority of existing and future network technologies, while at the same time abstracting factors that have low influence on the overall performance. I focus on SDN and NFV as examples of modern virtualization technologies.
• Network deployment meta-model—an interface between DNI and other meta- models that allows to define mapping between DNI and other descriptive models. The integration with other domain-specific models allows capturing behaviors that are not reflected in the DNI model, for example, software bottlenecks, server virtualization, and middleware overheads.
• Flexible model solving with model transformations. The transformations enable solving a DNI model by transforming it into a predictive model. The model transformations vary in size and complexity depending on the amount of data abstracted in the transformation process and provided to the solver. In this thesis, I contribute six transformations that transform DNI models into various predictive models based on the following modeling formalisms: (a) OMNeT++ simulation, (b) Queueing Petri Nets (QPNs), (c) Layered Queueing Networks (LQNs). For each of these formalisms, multiple predictive models are generated (e.g., models with different level of detail): (a) two for OMNeT++, (b) two for QPNs, (c) two for LQNs. Some predictive models can be solved using multiple alternative solvers resulting in up to ten different automated solving methods for a single DNI model.
• A model extraction method that supports the modeler in the modeling process by automatically prefilling the DNI model with the network traffic data. The contributed traffic profile abstraction and optimization method provides a trade-off by balancing between the size and the level of detail of the extracted profiles.
• A method for selecting feasible solving methods for a DNI model. The method proposes a set of solvers based on trade-off analysis characterizing each transformation with respect to various parameters such as its specific limitations, expected prediction accuracy, expected run-time, required resources in terms of CPU and memory consumption, and scalability.
• An evaluation of the approach in the context of two realistic systems. I evaluate the approach with focus on such factors like: prediction of network capacity and interface throughput, applicability, flexibility in trading-off between prediction accuracy and solving time. Despite not focusing on the maximization of the prediction accuracy, I demonstrate that in the majority of cases, the prediction error is low—up to 20% for uncalibrated models and up to 10% for calibrated models depending on the solving technique.
In summary, this thesis presents the first approach to flexible run-time performance prediction in data center networks, including network based on SDN. It provides ability to flexibly balance between performance prediction accuracy and solving overhead. The approach provides the following key benefits:
• It is possible to predict the impact of changes in the data center network on the performance. The changes include: changes in network topology, hardware configuration, traffic load, and applications deployment.
• DNI can successfully model and predict the performance of multiple different of network infrastructures including proactive SDN scenarios.
• The prediction process is flexible, that is, it provides balance between the granularity of the predictive models and the solving time. The decreased prediction accuracy is usually rewarded with savings of the solving time and consumption of resources required for solving.
• The users are enabled to conduct performance analysis using multiple different prediction methods without requiring the expertise and experience in each of the modeling formalisms.
The components of the DNI approach can be also applied to scenarios that are not considered in this thesis. The approach is generalizable and applicable for the following examples: (a) networks outside of data centers may be analyzed with DNI as long as the background traffic profile is known; (b) uncalibrated DNI models may serve as a basis for design-time performance analysis; (c) the method for extracting and compacting of traffic profiles may be used for other, non-network workloads as well.
Over the last decades, cybersecurity has become an increasingly important issue. Between 2019 and 2011 alone, the losses from cyberattacks in the United States grew by 6217%. At the same time, attacks became not only more intensive but also more and more versatile and diverse. Cybersecurity has become everyone’s concern. Today, service providers require sophisticated and extensive security infrastructures comprising many security functions dedicated to various cyberattacks. Still, attacks become more violent to a level where infrastructures can no longer keep up. Simply scaling up is no longer sufficient. To address this challenge, in a whitepaper, the Cloud Security Alliance (CSA) proposed multiple work packages for security infrastructure, leveraging the possibilities of Software-defined Networking (SDN) and Network Function Virtualization (NFV).
Security functions require a more sophisticated modeling approach than regular network functions. Notably, the property to drop packets deemed malicious has a significant impact on Security Service Function Chains (SSFCs)—service chains consisting of multiple security functions to protect against multiple at- tack vectors. Under attack, the order of these chains influences the end-to-end system performance depending on the attack type. Unfortunately, it is hard to predict the attack composition at system design time. Thus, we make a case for dynamic attack-aware SSFC reordering. Also, we tackle the issues of the lack of integration between security functions and the surrounding network infrastructure, the insufficient use of short term CPU frequency boosting, and the lack of Intrusion Detection and Prevention Systems (IDPS) against database ransomware attacks.
Current works focus on characterizing the performance of security functions and their behavior under overload without considering the surrounding infrastructure. Other works aim at replacing security functions using network infrastructure features but do not consider integrating security functions within the network. Further publications deal with using SDN for security or how to deal with new vulnerabilities introduced through SDN. However, they do not take security function performance into account. NFV is a popular field for research dealing with frameworks, benchmarking methods, the combination with SDN, and implementing security functions as Virtualized Network
Functions (VNFs). Research in this area brought forth the concept of Service Function Chains (SFCs) that chain multiple network functions after one another. Nevertheless, they still do not consider the specifics of security functions. The mentioned CSA whitepaper proposes many valuable ideas but leaves their realization open to others.
This thesis presents solutions to increase the performance of single security functions using SDN, performance modeling, a framework for attack-aware SSFC reordering, a solution to make better use of CPU frequency boosting, and an IDPS against database ransomware.
Specifically, the primary contributions of this work are:
• We present approaches to dynamically bypass Intrusion Detection Systems (IDS) in order to increase their performance without reducing the security level. To this end, we develop and implement three SDN-based approaches (two dynamic and one static).
We evaluate the proposed approaches regarding security and performance and show that they significantly increase the performance com- pared to an inline IDS without significant security deficits. We show that using software switches can further increase the performance of the dynamic approaches up to a point where they can eliminate any throughput drawbacks when using the IDS.
• We design a DDoS Protection System (DPS) against TCP SYN flood at tacks in the form of a VNF that works inside an SDN-enabled network. This solution eliminates known scalability and performance drawbacks of existing solutions for this attack type.
Then, we evaluate this solution showing that it correctly handles the connection establishment and present solutions for an observed issue. Next, we evaluate the performance showing that our solution increases performance up to three times. Parallelization and parameter tuning yields another 76% performance boost. Based on these findings, we discuss optimal deployment strategies.
• We introduce the idea of attack-aware SSFC reordering and explain its impact in a theoretical scenario. Then, we discuss the required information to perform this process.
We validate our claim of the importance of the SSFC order by analyzing the behavior of single security functions and SSFCs. Based on the results, we conclude that there is a massive impact on the performance up to three orders of magnitude, and we find contradicting optimal orders
for different workloads. Thus, we demonstrate the need for dynamic reordering.
Last, we develop a model for SSFC regarding traffic composition and resource demands. We classify the traffic into multiple classes and model the effect of single security functions on the traffic and their generated resource demands as functions of the incoming network traffic. Based on our model, we propose three approaches to determine optimal orders for reordering.
• We implement a framework for attack-aware SSFC reordering based on this knowledge. The framework places all security functions inside an SDN-enabled network and reorders them using SDN flows.
Our evaluation shows that the framework can enforce all routes as desired. It correctly adapts to all attacks and returns to the original state after the attacks cease. We find possible security issues at the moment of reordering and present solutions to eliminate them.
• Next, we design and implement an approach to load balance servers while taking into account their ability to go into a state of Central Processing Unit (CPU) frequency boost. To this end, the approach collects temperature information from available hosts and places services on the host that can attain the boosted mode the longest.
We evaluate this approach and show its effectiveness. For high load scenarios, the approach increases the overall performance and the performance per watt. Even better results show up for low load workloads, where not only all performance metrics improve but also the temperatures and total power consumption decrease.
• Last, we design an IDPS protecting against database ransomware attacks that comprise multiple queries to attain their goal. Our solution models these attacks using a Colored Petri Net (CPN).
A proof-of-concept implementation shows that our approach is capable of detecting attacks without creating false positives for benign scenarios. Furthermore, our solution creates only a small performance impact.
Our contributions can help to improve the performance of security infrastructures. We see multiple application areas from data center operators over software and hardware developers to security and performance researchers. Most of the above-listed contributions found use in several research publications.
Regarding future work, we see the need to better integrate SDN-enabled security functions and SSFC reordering in data center networks. Future SSFC should discriminate between different traffic types, and security frameworks should support automatically learning models for security functions. We see the need to consider energy efficiency when regarding SSFCs and take CPU boosting technologies into account when designing performance models as well as placement, scaling, and deployment strategies. Last, for a faster adaptation against recent ransomware attacks, we propose machine-assisted learning for database IDPS signatures.