The system health

What is a healthy information system?

In one of my previous workplaces, I was faced with the exact same question. I remember it well, it was during one of the more intense debugging sessions when we got a message from our principal architect. He was so kind to prepare a survey form for us developers to easier express our thoughts out. It was quite exhaustive so I put it away until I had more spare time (and focus).

It started pretty straightforward, with questions that are pretty common, and answering them would be a minor chore. But, it quickly started to ramp up, to the point that I had to accept the simple truth - I don't know. It felt like an absurd situation, especially after having many years of experience and countless hours getting systems up from their catatonic state due to misconfiguration, network issues, high memory strain, or data disbalance. I could only give some basic explanations that to me even sounded too general. But, how to tell if the system is actually healthy or not? I decided to find out but the actual facts were not that apparent and were buried down in the internet's depths ("system health" is not that good search keyword to Google in this case) - it would take me too much time.

I just decided to skip this survey (luckily, it was optional) and leave the investigation for another time. Apparently, the other devs were feeling the same, as the number of responses was in the single-digit range. Later, we were presented with a good set of pointers that would reveal more of the aspects a healthy information system should have. This post is mostly inspired by the same facts we were presented with that day.

The human health

For the past 2 years, we were bombarded by all sorts of health advice, most of them driven by the Covid pandemic. Let's set aside all of the controversies regarding this topic and just state the obvious - health is the most precious thing we could have at any given time. But, how can you tell if you are healthy or not? Listening for "weird" signs your body is giving you? Measuring the body temperature? Making doctor's appointments? Self-diagnose on WebMD :) ? If you feel healthy and you want to stay that way what should you do? Should you visit the doctor? How regular the checkups should be? What preventive remedies should you take? What genetic "red flags" could you have found? A simple question and yet it doesn't look like it can be answered with an equally simple answer at all.

If you take into account how many dangers we face on a daily basis and how reckless sometimes we could be it's easy to become a hypochondriac. Fortunately, the human body is gifted with the natural defense system - the immune system. It works "in the background", without our notion and it's pretty effective in protecting or healing our bodies. However, it has its limits of course and our duty is to help it and make it stronger (or try at least).

It's the same analogy I like to use when discussing information system health. How can we monitor our application? What to look for? How do we decide if it's healthy or not? What can we do to make it more resilient?

Now, I'm not a doctor and my health advice shouldn't be taken seriously - I will skew the reality just to prove (badly) my point.

The average healthy adult male likely runs 100 meters in between 15 and 20 seconds, depending on the age, of course. What would you tell if a high-performance athlete, for example, Usain Bolt starts consistently giving 15 seconds 100m sprints? Would you consider him healthy or sick - or something else? It's hard to comprehend if you just look at those runs on the TV screen but in real life to achieve the time of 9.58 seconds on a 100m run Usain Bolt has to enter the Warp speed compared to the average Joe's 15-20 seconds. Yet everything comes with a price.

The image below neatly illustrates the simplified view on the main aspects of a healthy person.

The system health conundrum

The same running analogy is probably applicable to information systems. We should treat our systems as virtual entities trying to serve and fulfill our customer's needs. If a virtual entity gets sick it will simply not be able to fulfill its purpose in a proper manner or even at all. Discuss what qualities are important for healthy human beings, and then imagine what that might imply for our virtual friends would be probably the best way to tackle this issue. Unfortunately, this demands vigorous efforts not only from the tech staff but also products and businesses staff too. More than once I was working in an environment where health concerns were completely different from the different perspectives, introducing more obstacles and confusion than was necessary. No man has ever had open heart surgery while doing the appendix removal and getting a haircut at the same time (or I hope at least). That's why the initiatives should be prioritized and aligned between all the actors involved.

When talking to the tech gurus most of the time they are obsessed with the performance and monitoring or to be more precise: how many requests our system can take in a minute? How did we get to this question? What number is the best number? Who decided it? And more importantly, do we need it? Managers are concerned about the velocities and business people by the influx of new and satisfaction of existing customers. Often than not, those numbers are not taken seriously or their priorities are completely scrambled. A result is probably optimization on less important parts of the system, monitoring the wrong metrics, making assumptions upon sparse sets of data, and spending high amounts of money on the tools that are not helping much.

The Crisis-driven care

Without the unifying idea of system health, we are unable not only to fix the current issues that are roaming but we can't tell whether any of our systems are at long-term risk. Ofter, our eyes are only set on the tools that we think can help us.

Tools are great, don't get me wrong here but without sensible interpretation, they just spew data out. What is the exact limit of our system? What alerts should we trigger when we get to the yellow or red zone? What are the predictions in the future and what remedies should be applied? What time is the "flu season"?

Not knowing the answers to these questions will make us relatively good at crisis management - we can tell when our systems and services are failing, and we regularly jump in to fix them. Most of this is driven under "panic mode" and the lessons learned are often too expensive.

Preventive measures

Monitoring health is necessary but being good at applying preventive measures is where the magic happens. Visualizing system health in a way that helps us prevent services from failing in the first place. Expecting the system to be driven to the ambulance constantly is probably not anyone's goal and it quickly becomes tedious, dangerous, and then finally unsustainable. The system becomes tired, bruised, and weak until finally, the plug has to be pulled out. Preventing such grim scenarios in the early stages should be the main goal of good-aligned health improvement initiatives. But as always, it's easier said than done.


1. Healthy Body

Aerobic fitness

For humans, aerobic fitness determines our readiness to continue with the given activity as long as necessary.

Software solutions sometimes leak resources. The most common resource is probably the memory but also database transactions, network connections, disk resources are often prone to leaking. No matter how high or low the resource quantities are, given the right amount of time the leak issues will bring the system to its knees. A healthy system is not leaking resources but gracefully returning or releasing the used assets. Unhealthy systems leak, unable to release the resources and crawl to a halt until restart happens due to exhaustion.

Is your system able to continue for as long as necessary? Or does it need to restart itself constantly in order to recover the resources it needs to continue?

Strength

Physical strength is the measure of a human's exertion of force on physical objects. Without the right amount of strength, the daily activities would be challenging for any of us. Just imagine how much lifting, pushing, and pulling we do during just one day without thinking about it. But now and then something extraordinary happens, your car is broken for example, and you need to do extra hard work to push it to the side of the road. All the extra work and strength training you've done before will be detrimental to this activity.

Different systems require different levels of processing power, just like our daily activities could differ from someone else's. Some systems have roles that require more heavy lifting than others. How strong your system needs to be for the given job? It's important to know how many resources the system needs to be able to deal with the extraordinary activities. Doing occasional benchmarking would become beneficial here - we should find out the exact limits - the amount of work that our system actually can handle and reveal how it stacks up to the reality we experience.  Otherwise, you are risking discovering, in a hard way, what the limits really are. More often than not, it happens when you least expect it and it becomes the outage you can not afford.

How much processing power your system really needs and uses? Is your system idling and spending precious resources? Are the requirements rising or falling? Can you identify the patterns? Does the heavy lifting vary during the day, week, or month? Can your system push a bit behind the limits without crashing?

Weight

The ideal weight differs from person to person. While being slightly over or under the ideal value is probably fine, having a more prominent disbalance could pose a bigger risk to the overall health.

The weight of a system is the amount of data it stores, maintains and takes responsibility for. It's not the exact number and can't be determined without taking into account the type of system it is. By identifying different aspects of the system it becomes apparent that every system should have a target weight - the amount of data it is expected to be able to store, index, and surface to queries. Overrunning those values will make the system slow and unresponsive but being below the target value means the system is likely not productive.

How much data is your system handling? Is your system storing data that it does not need? Is this slowing it down? Is your system producing enough data?

Immune Response

As stated in the post's introduction the vital mechanism to fend off attacks is the immune system. Services fail, that's the hard truth, especially the services relying on the internet connection (remember AWS outages?).

Systems should know when they are unhealthy and, where possible, be able to compensate for it. Systems that don't produce enough diagnostic data will eventually fail, leaving us scratching our heads and struggling to figure out what exactly happened. If your system is over-reacting or paranoid, on the other hand, it will most often produce mountains of logs, irrelevant messages and cryptic stack traces. Too much or too little diagnostics are both equally problematic and solving the issues would look like finding a needle in a haystack - not fun.

If your system struggles with brute force attacks and requires manual intervention every time, it is not fit for life on the internet. On the other hand, if your system exposes ports to the public unnecessarily how soon can you expect to find out some bots sniffing around? What should we do to be able to monitor the exposed ports before something bad happens?

Can you tell when your system fails? Is it able to heal itself? Or it is expecting constant nursing? Can you distinguish between minor and major issues and threats? Does your system over-react or is it indifferent?

2. Healthy Mind

Stress Response

The ability to handle stress well proved to be crucial when it comes to good mental health. Being under ongoing stress, however, leads not only to impaired mental balance but also to high levels of physical stress which degrades every part of our bodies.

While an attack or an illness is one of the most common forms of stress, we all get put under pressure produced by many different things or situations. What's more important we often tend to transfer our stress onto other people or the environment. Systems that have the ability to adapt to stress without passing that pressure onto other systems, or even onto real people, provide real reassurance in a crisis.

Is your system able to scale when it's under pressure? Does it cache frequently used data? Is it draining pressure from downstream systems? Do you stress-test your system? When was the last time you tested your system with twice it's peak traffic?

Adaptability

The world we are living in changes constantly and does it mostly in a quick manner (remember the world prior to Covid), and we are constantly forced to adapt to our changing environment and landscape. The ability to adapt to change is a key part of our survival. Similarly, our business changes every day, and our systems have to express similar adaptability in order to survive and be efficient.

The software mostly does not age well. Tending the software and rejuvenating it should be a constant process. Just try to remember what technologies were used prior you started your professional IT career. Software practices evolve.

But, older software systems tend to get set in their ways, they will struggle to adapt - as a result, most of them get retired. Larger systems, on the other hand, acquire their own momentum which makes them very difficult to adjust and often completely miss the opportunity to change. Old systems become progressively less secure (new exploits discovered), less useful, and less attractive to maintain - and in the end more expensive to run. By adopting the new practices, systems become cheaper, safer to run and maintain - but also more attractive (incorporating state-of-the-art technologies). Companies often miss seeing this on time and become locked into the old ways of doing things - staying for too long in the comfort zone.

How good is your organization at adapting to the changes? Can you introduce new things/technologies to your system quickly and easily? Do you tend to use fresh technologies and standards? Can you update your system and release it at a moment's notice when problems arise? Is it covered by automated regression tests? Is it of a manageable size for quality control purposes?  

Responsibility & Determination

Responsibility, according to Wikipedia, is a feeling of commitment or expectation to perform some action in general or if certain circumstances arise. Determination, on a similar note, is a positive emotional feeling that involves persevering towards a difficult goal in spite of obstacles. The ability and focus to protect and care for themselves and others are the traits of responsible people. Determined people know exactly what things they want to do and what skills to possess in order to reach the given goals.

Systems without clear responsibilities tend to become bloated as they are often understood as a catch-all bucker for unrelated functions. Such systems are generally really hard to understand and have a lot of side effects. Such systems are complex and expensive to maintain and during their lifetimes they probably introduce more problems than they actually solve.

Being determined means the system must either succeed in achieving the goals according to responsibilities or let us know if it was unable to do so. Except monitoring the system might involve mechanisms such as retry policies, good fall-back mechanisms, circuit breaks, and many others in order to be resilient and prepared to serve its duty even during the hard periods.

In conclusion, it's hard for any system (or person, for that matter) to be responsible and determined if its core purpose is not clear, firmly identified and strongly embraced.

What core problem is your system trying to solve? Could you identify and briefly describe its main purpose or the core domain it tries to tackle? Is the system solving a particular customer problem or is it providing specific in-house service? Can you tell if your system is having mixed responsibilities? If yes, how did it come to the point you had to mix the responsibilities? Is your system able to fulfill the goals when the up/down-stream services are failing? When it fails to achieve the goals, is there a way to know about it?

3. Healthy Habits

Regular Exercise

By moving, you are strengthening your muscles, which improves stability, balance, and coordination. Stretching helps maintain your muscle health as well. Regular movement and moderate exercising will improve a person's overall physical health but also mental health and wellbeing.

A system - or any part of a system - similarly to our bodies, that does not do get used on a regular basis will degrade. The code gets flabby and underused, often failing when you try to use it. Keeping a system - or a feature - well exercised is the only way to know that it won't fail. Underused systems should be retired. Another strategy would be to cut down the system to size.

Are all parts of your system being used? If not, why are they still there? How would you approach cutting them down to fit the exact needs?

Diet

The food we eat directly affects our gut health (or the balance of good and bad bacteria) and influences the production of neurotransmitters. After all, we are what we eat. A poor diet can leads to ill-health, lack of energy, loss of speed and a reduction in overall fitness.

Information systems consume data. Accepting large quantities of poor data is equal to people overeating on ice cream and diet coke. A system that accepts large quantities of poor-quality data will not only be incomprehensible, on the other hand, but a system that doesn't accept enough data is practically useless. Controlling the quality of data is mandatory to have a healthy system. You should prepare your system to consume or accept only the data of expected quality and format. Not giving much attention to this will lead to wasting resources on storing the useless information but also the risks of corrupting the existing data get higher. Rejecting the data of poor quality or format is a common practice as long as the system is able to notify you about it. Dieting practices should be tweaked to the exact system needs.

What data does your system accept? Can you control data quality through all the layers of your system? Can it discard or reject data and requests that it doesn't understand or shouldn't use? Does it accept a predictable amount of data over time?

Hygiene

Wash your hands! Silly enough but people still need to be reminded about such basic hygiene measures even in the 21st century. Covid pandemic just amplified the importance of such a simple habit.

There are people who are really determined and spending all of their time trying to gain access to personal data through shady practices on the internet. Some of them are praying for money, some of them for personal data while some of them are looking for compromising material. Losing your many due to poor practices that service implements brings the company to a bad reputation list. How many leaks have you heard about in the previous years? Giants like Facebook or LinkedIn are even prone to this with Twitch as the latest example of the big data leaks. Depending on the country the data we are dealing with has to comply with some rules, GDPR to name one. Some of us made careers dealing only with data hygiene and data client's data security depends on us. The system is strong/secure as it is its weakest link. It's never a good feeling when the company loses a couple of million dollars in a couple of minutes, no matter who's at fault.

Do you have proper authorization/authentication of your customers? What about internal users and tools? How do you handle private data? Are the passwords your system is storing safe? How about data and communication encryption? When did you last time actually check the system for the holes in data sanitization?  

4. Healthy Community

Reliability

Having individuals with the ability to provide expected results without constant supervision and maintenance makes the whole team more productive. Such individuals have a tremendous ability to concentrate but also are aware of their surroundings and are able to efficiently delegate.

Each system must be reliable - it must react in the ways we expect it to. Having an unpredictable system brings a lot of confusion and stress to both customers and the development team alike. Constant supervision, upkeep, and other ways of reassurance get really expensive in the longer run. Understanding how reliable the system is is of vital importance to have a good product everyone can be proud of.

How much downtime has your system had in the past months? Can you pinpoint the issues by investigating the logs? What would be your prediction about the next downtime? Should it restart itself when it fails or do you need to act? How does its failure impact the rest of the systems?

Pace

Pace consistency is at the heart of a runner’s training and is important for endurance races like the marathon. Achieving pace consistency seems simple, but mechanically our bodies make constant and subtle adjustments. Having a good pace is the main ingredient of keeping reaching goals effectively. But, how can we reach a good pace and what to do when fatigue strikes?

Modern society is made of impatient people. Attention spans are shortening. Holding the customer's attention is the highest priority of many systems to the point it becomes detrimental. Not gaining customers' attention is serious but even more important is to not unsettle to unnerve them in the process of providing the services.  

On a similar note, having a system that's hard to maintain could produce unsettled development teams. Having slow development or release cycles will definitely make teams feel less productive, less effective, and therefore less motivated. Being too impatient to introduce the features is the other side of the coin. It leads to hectic and unfocused product development and teams with too short attention to introduce the positive changes. We love improvements and measuring a system's speed can help us understand where we need to focus our efforts. Discovering the consistent pace has the benefit of producing the expected output and quality but also a high level of trust and reliability.

Can you tell when your system is running slowly or having an inconsistent pace? Can you identify the parts of your application that could be reworked in order to get a better customer attention span? Do you know what scenarios in your application are too slow to complete? Do you measure the speed of your key operations in production? What are the numbers/medians? Do you know what part of the system is culpable for making the largest customer churn?

Independence

Independence is highly related to reliability. Any person who relies too heavily on others to do the daily tasks will struggle to cope without them. Such persons are highly dependent upon others and therefore probably not reliable in many aspects.

In the context of systems, unnecessary dependency is strongly frowned upon. We all read the many articles about making our systems/code less dependent upon each other. Having a system with many dependencies is leading to decreased robustness of the system in question - any issue in the dependency chain could yield cascade effects and failures. All organisms, organizations, and software are vulnerable to cascade effects. In software, when some dependency fails it can take out the next, and the chain effect will go on until the system breaks pulling the business and organization with itself. The modern cloud environments are especially prone to these issues and it's important to know how independent - or not - your system is. Knowing the facts will move your efforts in the right direction in order to take measures to compensate for failure, where possible.

What systems are essential for your system to run? How will your system behave when one of these dependencies fails? What dependencies are nice to have? What dependencies need to be available in (almost) real-time? What systems are essential for yours to run? Are there any dependencies that, when unavailable, will cause yours to fail even if you don't use them?

Sociability

Communication boils down to exchanging of information either by speaking, writing, or using some other medium (coding language, to name one). Communication is the key. It's important to be able to communicate well with others either in the team, in information systems, or in any kind of relationship really. The ability to communicate effectively through different means is what makes us unique in this part of the universe (in this moment, at least).

Our systems must be able to socialize and communicate effectively by using standard, predictable, protocols. Having an outdated protocol or a way to communicate is not in our best interest and can be deemed as anti-social. But, communication between systems is just one aspect of the whole sociability concept - the systems should be able to clearly communicate and socialize with both customers and development teams.

Nowadays, we use many tools for communication - Slack, Teams, Zoom, Telegram, Whatsapp, Signal, Skype, Twitter, Reddit, Instagram, LinkedIn, Emails, just to name a few. Communication is deeply ingrained into our beings and technology just provided new and convenient ways to communicate. How can we utilize these tools in order to make communication better? Beware, this is not just about sharing email messages or IMs. We, as developers, should also communicate through the log messages, variable names, class names, proper reports. Bringing proper protocols and standards in this part of communicating with each other could make a difference between good teams and excellent ones. Product people should communicate with the tech through various documents and meetings but also through charts, diagrams, and numbers. And let's not forget about our customers - we don't just communicate with our customers through customer support, we actually communicate through our entire system. Providing a sleek user interface is just one way of communication and we shouldn't stop there - clear messages, notifications, and errors but also the intuitive user experience and visual queues are some of the most important ways to communicate to customers without having direct verbal or written communication. I'm not going to mention promotional and other marketing ways of communication here but they exist and are important to gain new customers.

Does your system use a single communication standard and protocol? Do you utilize the latest and most secure ways of communication? How clear are your APIs? Do you have clear and consistent documentation? Do you think the code itself should be the way to communicate and document things out? Do you rely on code comments? Do you utilize tools like linters and formatters? Are your coding standards well documented? What are your expectations when it comes to code quality? Are your log messages consistently formatted, well written, and well spelled? Are you logging at the right levels and is finding the proper message easy? How about the UI? Do you send coherent messages to customers? How do you communicate errors and warnings in the UI? Do you send promotional emails/notifications? Do you track customer unsubscription rates? How spammy your system is?

Patience

Patience is the ability or capacity to accept or tolerate delay, problems, or facing obstacles without becoming annoyed or anxious.

One of the biggest hurdles in the development process is to distinguish between urgent and important tasks:

Urgent tasks are mostly tasks that have an immediate deadline or a deadline that has passed. Most of the times tasks become urgent when they had to be accomplished but we kept deferring it. It is also not necessary that these tasks will have a significant impact on your life in fact they may be very trivial or silly - but they have to be done. In many cases, such tasks seem to be trivial but bring disruption in the process of development.

Important tasks, on the other hand, do not need to have a deadline looming over the person. They are important because of the impact that they can have on the system in question. Again, these need not be time-consuming or effort-intensive and may not require you to do it immediately.

Know the difference between urgent and important. Many times I've experienced situations when these two were treated as synonyms. Working in a team, we have to decide whether something is urgent, requiring an immediate response, or it can wait until a more convenient time. Distinguishing between those two concepts is all that's patience about but we miss it so often.

What is your distinction between urgent and important tasks? Do you feel they are used improperly on some occasions? Does your system expect an instant response when it doesn't need it? Have you considered sending a message instead of making an API call? Can you deliver information from a cache? Can you flush that cache when necessary?

5. Healthy Environment

Stable climate

The Earth's climate is stable within certain limits. This benign climate is arguably the main reason why our species has been able to progress this far. Still to this day, our massively complex society relies on the ability to utilize agriculture and build cities and infrastructure in places that won't be flooded, washed away, or torn down. But, the changes happen. We keep an eye on or migrate to other, more friendly areas in search of more resources. Or, we simply need to move due to some natural disaster. Keeping an eye on our surroundings is engrained in our human nature and an integral part of our ability to survive. Due to the development of science, we don't need to just rely on our senses but also we have got tools to measure, track and predict possible climate changes. For example, by looking at the measurements made in the past decades we need to ask ourselves if climate change will happen, is global warming a genuine threat, and will the Earth be able to bring the balance again, or do we as humanity need to intervene (or can we)?

Similarly to us needing a stable climate, the information system requires a stable environment in order to operate and survive. No matter if you are running your application in a virtual machine, on a single dedicated server, a cluster of machines, or cloud infrastructure, having a predictable and stable environment will bring peace and prosperity to your product. Your team should be aware of their surroundings, keep an eye on the upcoming updates, known vulnerabilities, and alternatives - and be capable to predict the next possible threats. In the same manner, we should always be careful not to bring disruptive changes to our environment, risking introducing instabilities and disasters.

Do you patch and update the operating systems regularly? How would you know if you are missing the crucial update? Do you have a reasonably simple way to either detect or change the vulnerable parts of the system? Do you keep track of the alternatives? Should you consider asking for or purchasing additional support for the libraries, packages, and software you use from the 3rd party vendors? Do you run unnecessary software in the same environment along with your information system or apps? Who can introduce changes to the environment and install more software? Can you detect issues or failures with your hardware resources on time? Do you have alarms set in place? How quick/easy could you migrate to the other server instance?

Clean environment

Clean air and water are definitely the most important elements for the prosperity of this planet. However, you shouldn't spend more than a few minutes to actually see how careless we can be about them or the environment as a whole. We keep struggling with more and more vehicles, factories, domestic and other waste. Our population is rapidly growing in numbers and taking care of our environment should be of utmost importance.

As your system experience growth in traffic, it's probably inevitable that some parts of it will get polluted. More than once, some of the virtual machines/servers I was connecting to were polluted with all sorts of stale data, huge cache files, executables, compressed files, and random notes all around - sometimes even plain texts with some credentials or sensitive data were there, just for fun I guess.

What policies would you enforce in order to keep your system's environment clean? Can you detect if the breach happened? Are you backing up and are your backups actually pollution-free?

Resources

We are dependent on various resources in order to survive. Drinkable water and breathable air are sharing the highest priority but right after that is the food. Nature was kind enough to provide plentiful resources for us to enjoy but the real increase in the human population took a turn with the discovery of agriculture. The cultivation of crops and livestock still rely on natural resources, and since those resources are still limited we tend to migrate to more rich areas once we deplete them. It's easy to deduce what would happen if we deplete all the resources quicker than they could renew - costs ramp up, hunger strikes and our survival is in jeopardy.

Hardware resources seem cheap, but as our systems grow the costs ramp up pretty quickly. If we lose control of our system it could yield high monthly sums and probably high blood pressure along the way. This is particularly evident in the cloud and serverless environments with the ability to use more resources for scaling in order to support occasional high loads. But, the internet is rich with horror stories related to unexpected or uncontrolled costs produced by either misconfiguration, leaks, or even exploits - infiltrated crypto miners would prove to be really expensive. Many times, we dismiss the idea of high costs or looming exploits - but just remember how many NPM packages were infected with all sorts of malware, sitting still and insidious in our repositories. Knowing how much resources we have at disposal, monitoring their usage, and setting spending limits accordingly is one way to avoid nasty monthly surprises.

How many resources are you spending on a monthly basis? Do you think this amount is reasonable? Are you able to detect if some processes are wasting your resources? Do you have proper spending limits set? Can you overspend and by how much? If your system is running on-premises or you don't have an easy way to scale, should you consider migration and when? Does your system actually need more resources or you can release some? What are your predictions?