As digitalization progresses, our IT landscapes are — unsurprisingly — becoming increasingly interconnected and their IT systems and services linked together via interfaces, making them ever more complex. An actual existential dependency between two systems or services is avoided as far as possible. If one IT system fails, the other must be able to handle it and must not also fail. If this does happen, there are potentially high follow-up costs.

How expensive is downtime?
But is such a system failure really that bad? Here, it depends largely on where this failure happens. In a survey conducted in 2019, the company found ITIC revealed that the costs of one hour of failure of critical IT systems in the organizations surveyed amount to an average of more than $100,000. 86% of organizations even stated that these costs in their environment amounted to more than $300,000 per hour of downtime. Now every organization can calculate for itself from which cumulative downtime of critical systems the associated costs are painful.
The dark side of power
One thing is certain: At some point, something goes wrong in every IT system and it can fail.
Experience shows: In. Everyone. system.
Why is that so? We are already taking a lot of measures to make our IT systems and services of good quality and secure. And that is really expensive!
Everyone in IT is familiar with technical debt today. At some point, developers or integrators took a shortcut to solve a problem and never really cleaned up the resulting “trash.” And then took the next abbreviation. And the next one. Fortunately, you can measure technical debt, find it using static code analysis and settle it.
With dark debt — dark debt — that is not possible. Dark Debt are errors and weak points in our IT landscapes that cannot be discovered with static code analysis. Dark debt is often only discovered when a problem occurs. And then the damage is there.
One of the few opportunities to make dark debt visible before it occurs is to test the conditions under which our IT systems and services run and to observe how the systems and services behave — including when interacting with one another. If we discover weak points or errors in this way, they are corrected or minimized as far as possible. In this way, we make our systems more resilient to (unplanned) changes.
The magic spell is: Resist! Resist!
Resilience is the ability of systems to react to change without losing their basic function. Changes can be diverse in the context of IT systems:
- Change from:
- number of users
- amount of data
- Computational effort
- available resources
- Response-time behavior of linked systems
- functional changes
- bug fixes
- errors that have occurred
The reduction of load, for example by reducing the number of users, should be able to be handled without difficulty by any IT system. This becomes more difficult when the IT system is configured in such a way that, in this case, previously promised hardware resources are released and are no longer available to the system.
In order to make the entire IT landscape resilient to change, you must therefore find out how a critical IT system, or ideally all critical components of an organization's IT landscape, react to errors and other changes, and take appropriate measures based on the knowledge gained.
Chaos Engineering
That sounds like a life's work, as an organization can own any number of critical IT systems and services and, on the other hand, IT systems tend to change — constantly — due to adapted requirements and framework conditions.
One discipline that has become increasingly popular in the context of resilience in recent years is chaos engineering. The word chaos in chaos engineering causes stomach pain for many people in charge due to a lack of deep understanding of the topic. A proven way to address this is to use the term resilience engineering instead of talking about chaos engineering. After all, the goal is not to spread chaos, but to make IT systems resilient.
The Chaos Engineering process is ultimately a recurring cycle.
1. Architectural image
Create or update an abstraction of the IT landscape to be considered and the dependencies between the systems and services it contains.
Make assumptions about where errors and vulnerabilities are potentially contained in the IT landscape and what effects these errors/vulnerabilities may have.
2. Chaos Backlog
Create a backlog from these sources. Try to estimate the probability of occurrence and the effect of the failure occurring or triggering the vulnerability.
Use these insights when prioritizing where to find them.
3rd Chaos Experiment
3.a) For the highest prioritized backlog entry, consider how you can prove or refute the assumed effects of the error/vulnerability in a controlled environment in an experiment.
3.b) Carry out the experiment in a controlled environment and compare the result with original expectations.
3.c) If you are satisfied with the result, i.e. the system or service is reasonably stable, you can regard the Chaos Backlog Item as closed. The measured result may have an impact on other backlog items and it is necessary to adjust them.
3.d) If the result of the experiment is not satisfactory, adjust the examined system or service so that it reacts to the experiment adequately, e.g. without a crash, OutOfMemoryError, etc.
3.e) Repeat the experiment to prove that the adjustment is effective.
This cycle can/should be repeated until the result of the experiment is satisfactory.

Then it continues with point 3 above: With the next highly prioritized backlog item.
Depending on the organization, it is worthwhile to repeat the assessment of the weak points in the architecture from point 1 above once a quarter or when there is a significant change in the system landscape or even an individual, critical system or service, and adjust them as necessary. This then has an impact on the chaos backlog and the prioritization of backlog items.
Tips & Tricks
Even with chaos engineering, you can do things wrong, but you don't have to. Here is an overview of the most common mistakes:

Summary
Chaos engineering is a discipline that helps us to manage the complexity of our IT world and to make IT systems resilient to many unforeseen changes. With a good understanding of the dependencies between systems and services, a backlog that addresses potential problems, and the will and capacity to process this backlog, the resilience of IT systems and services can be significantly increased. The effort that an organization invests in chaos engineering and the learning curve that the organization must go through will definitely pay off through fewer failures and a better understanding of its own IT system landscape.