Distributed systems and diagnostics like in the movies

Marco Bartolini
11 min readSep 9, 2021

I have always been fascinated by the movies like Passengers, Terminator, Star Wars, and many more, where complex pieces of technology, after receiving substantial damage, say the classic catchy sentence: “running diagnostics…”.

At this point, screens are decorated with super-advanced UI [fig. 1] that are clearly showing what is going on, and the AI, Pilot, Cyborg, Terminator, Doctor is taking immediate actions following the diagnostics report, doing stuff like reducing power consumption, starting an automated surgery procedure, moving all the energy to a specific engine, or even better a good ol’ reboot.

Diagnostics console of a distributed system
fig. 1 — the movie Passengers is continuously showing this distributed system diagnostics report

Sometimes the imagination is not far from reality. In software engineering, we are constantly faced with the challenge of running distributed systems and keep them running to make whatever service available 24/7. Today I will take you on an imaginary space trip. You will be the captain of a spacecraft named what you want, my personal suggestion is “WeAreGoingToCrash II.”

There are many benefits in running diagnostics checks on a distributed system. Automating the diagnostic will save time, money, and error-prone manual triaging steps, making a painful and essential job very easy. In this article, I will try to answer few questions like is this possible? How does it work? But before we get started with this topic, let’s set a bit of common ground, shall we? What does distributed system mean, and what does it mean to run diagnostics on a distributed system?

What is a distributed system?

fig. 2 — the gods of software architecture

Once upon a time, the gods of software architecture were keeping their service on a single machine. Everyone was happy, and everything was working, until one day somebody started to use the service hosted inside the machine. It was mayhem, the machine went down along with the service running inside, so the gods decided to create a bigger machine. The ritual was called “vertical scaling,” everyone was happy, and everything was working, until one day, somebody started to use the service deployed inside the machine. It was mayhem, and it looked familiar. The machine went down along with the service running inside. That day the gods learned one lesson. No matter how big the machine, the worshippers of the service will always grow in number if they find the service useful and amusing.

The gods then gathered around a huge stone, and they carved on that stone the “non-functional requirements” list:
— performance
— scalability
— reliability
— availability
— extensibility
— maintainability
— manageability
— security
They were satisfied with the result of their discovery, but soon they realized that to cope with this wonderful list of requirements having a single machine was not the answer to the worshippers' problems. They needed at least two or more machines to share the growing load of requests. Still, for the worshippers, the service should have appeared like a single one from the outside. They could not afford to have the same service with two different names, the gods of sales and marketing would have killed them, and this is how distributed systems were mythologically born. Someone will argue that we do not need more machines, just good code, but that’s another story.
The gods were now facing a new wonderful set of problems derived from the distributed nature of the service. Making a worshipper happy is like killing the hydra. For one problem that you solve, two more shall rise.
Mythology aside, we can say that a distributed system, in its simplest definition, is a group of computers working together to appear as a single computer to the end-user. These machines have a shared state, operate concurrently, and fail independently without affecting the whole system’s uptime.

What does it mean to run diagnostics?

fig. 3 — The millennium falcon from Star Wars

Let’s say you are monitoring a complex system like a spacecraft (called WeAreGoingToCrash II).
Before getting from point A to point B, your priority would be to make sure the spacecraft is in perfect health before the flight. You would like to understand at a glance that everything is ok, and if it is not, to be alerted immediately with a clear message and possibly an automatic resolution to the problem that will take you away from any error-prone manual step or painful triaging. Even better, you would like the system to detect and resolve everything without your direct supervision, but how do we achieve that? Is it possible to do that? First thing first, we need real-time data. Working on old data or stale data will be of no use. You do not want to find yourself solving problems of the past while the future is come crashing upon you. The second thing we need is to make this real-time data meaningful. Let’s take an example. We want to monitor our spacecraft. Many things are going on in a spacecraft, and we have many interrelated pieces, so we need to first isolate everything in single units and then draw a map of how they are related, for instance [fig. 4]:

units
fig. 4 — Spacecraft single units

So now, given the above group of units, we can safely say that each unit has its own sets of things that make them healthy and worthy to check. For the oxygen tank, we want to know the tank's integrity, if there is a leak, the amount of oxygen inside, if the valves and the tubes are well connected and working. Now you will notice how important is the part of “real data,” what if we discover that the oxygen tanks are empty 3/4 hours later? Probably it would not be a problem anymore, would it be? Dead astronauts do not need oxygen after all.
Now that we have identified our units, we need to identify how they’re interconnected. To be more specific, we need to define the possible Failure Mode and Effects Analysis (aka FMEA) and automatize the diagnosis.

In diagnostics, there are two major types of data structures to consider tree and graph.

fig 5 — Spacecraft single units organized as a tree

In our simplistic case, we have a linear tree [fig. 5], the spaceship (root) depends on many units, but only two units are key for the other units. Without them, the other “subsystems” will not operate, or they can be considered useless. What’s the use of the lights if you can’t turn them on? If you are wondering, the engine is not electric, and then it is a sibling of the solar panel.
Now that we have a tree, it’s about time to run our diagnostics on this distributed system that we identify as a spaceship.
We sit in the control room, sipping our coffee and chit-chatting with our co-pilot, and everything looks good [fig. 6]:

fig. 6 — diagnostics reporting all good

At a certain point, the lights in the control room start to flash red, and an alarm starts to sound. We access the control panel, and we see this.

fig. 7 — diagnostics failures on the oxygen tanks

At a glance, it is clear we have an issue with the oxygen tank [fig. 7]. The health checks of the tanks are failing. The spaceship root is marked red since this failure is compromising the entire well-being of the crew. We interact with the oxygen tanks unit to see the details.

fig. 8 — detail of the oxygen tanks diagnostics report

It is clear that there is an issue with the valves in the system [fig. 8], and the oxygen levels are dropping, not critically, but we need to intervene as soon as possible to repair the damage. Now in our fantasy world, the control panel will show us a series of automated options like “repair automatically” or “call in the robots.” Once the issue is solved, the diagnostic reports everything is ok again, and we can resume our coffee chit chat. Just for the sake of our explanation, let’s say that this diagnosis [fig. 9] accompanied the previous alarm.

fig. 9 — a failure in the solar panel

In this case, since the solar panel is failing, all the units depending on it will be considered as failed since they cannot operate. Another case could be just a warning [fig. 10].

fig. 10 — diagnostic warning on the lights

In this specific case, the lights are partially working, and we do not have lights available in a specific part of the spaceship. This is not a critical issue, and it is just degrading the overall status of the spaceship without preventing it from doing the job. This is why the diagnostics are reporting just a warning, and you will have to eat a hot meal in the canteen surrounded by the darkness of space.

Before I mentioned having a graph amongst the possible diagnostics data structures, let’s look at it [fig. 11]

fig. 11 — Spacecraft single units organized as a graph

This is a nice upgrade to our craft. The engineers recognized that we could use part of the energy stored in the main battery to cover for power outages derived from the status of the solar panel.
Bonus questions how do the diagnostics reports if:
—the solar panel is compromised?
 — the main battery is compromised?
 — the main battery and the solar panel are both compromised?
If you have an idea, please do not be shy and share it in the comments.

Now that we have a grasp of diagnostics and how they work let’s dive into applying automated diagnostics to a distributed systems like our spaceship.

Diagnose a distributed system in practice

We are the software engineer tasked by the captain to create the software that will run the system’s diagnostics. As we said, we want real-time data, the first guess is for the device/unit to push data to a centralized system that will analyze the coming inputs and calculate the result. That sounds like a great idea because it is very efficient, we do something only if something happens, but in this case, it is the wrong approach. In this case, pulling is better suited than pushing as a strategy since it is more reliable. We are monitoring a system. We cannot rely on the system itself to give us feedback. If the solar panel goes down, there will be no unit sending feedback.

fig. 12 — diagnostic orchestrator flow

From a ten thousand feet point of view, we want to feed our spaceship’s unit data structure to an orchestration process and get the overall status [fig. 12]. The orchestration will traverse the tree in parallel executions
by levels 1 –> 2 –> 3, and report the status of the units.

The traverse would be totally different for a graph structure [fig. 11], in that case, we’ll start from the vertex representing the spaceship and then traverse all the siblings' vertex until we reach what we can call the structure’s bottom, after that, we will have to traverse the graph in reverse to design the full status of the system. If you are interested in the graph use case, drop a comment, and we can talk about it.

Going back to the unit status, the status can be:
— OK
— WARNING
— FAILURE
You can choose any value you want, numbers, chars, float, or even boolean. The most important thing is to know if the spaceship is about to explode or safe to take off.
It would be a nice touch if you could add some details, like the name of the diagnostic check, the result of the check, and a message that can contain the possible issue faced by the unit:

name: Check Oxygen Tank 1 valve status
message: No issue found
status: OK

name: Check Oxygen Tank 2 valve status
message: Maintenance is due in 1 day
status: WARNING

name: Check Oxygen Tank 3 valve status
message: valve 3 disconnected
status: FAILURE

To implement this orchestration, there are two types of diagnostics algorithms we want to run to make sure we are working on real-time data. We will call them OFF-cycle and IN-cycle diagnostics.

An OFF-cycle diagnostic is based on polling, we schedule a process that will run the diagnostics of each unit traversing the tree/graph structure and get feedback from every device. It will then aggregate the result and update the general data structure to represent the system's overall status. An OFF-cycle diagnostic is expensive from a resource point of view, but it is the most reliable way to get the full picture of the system.

An IN-cycle diagnostic example could be that I turn on the engine, and the engine does not start. The diagnostic of the single unit immediately reports an issue with the engine cooling system that prevents the engine from being operative. In this case, we tried to use the unit, and we got real-time feedback. If the diagnostic reports a failure, the system will trigger an OFF-cycle diagnostic prematurely to get the full picture.

An important aspect to consider is that every unit should be treated as a UNIT, so the diagnostic orchestrator will ask the single UNIT to run its own diagnostics and report the checks' summary and the final status. Why must it be the unit running the diagnostics checks and not another unit on its behalf? Firstly, if the unit is not responding, we assume the unit is in a failed state. Secondly, only the unit knows its own diagnostic, and it is the only one that can perform those checks from inside the unit itself. If something exists or is reachable from all the other units, it does not mean the unit we are analyzing can find or reach this thing like everyone else. This is why the unit must run its own diagnostics.

Once the OFF-Cycle and IN-Cycle diagnostics are in place, we can say that the skeleton of our distributed system diagnostic is ready to be tested, delivered, and used.

Conclusion

We had a quick look at what distributed systems are and what diagnostics are. We “dived” a bit into the logic behind creating a distributed system diagnostic and which algorithm and data structures are involved.
We can say that a reliable way of running diagnostics on a distributed system is an orchestration that traverses the data structure representing the system, using an OFF-Cycle and IN-Cycle algorithm gathering the single unit diagnostics and combining them in the final result.
Is it working? Definitely yes, I have created diagnostics for years to monitor different types of software architecture, to automatize failure detection in distributed systems composed of either physical devices like cameras, fans, lights, vehicles, or simple software components like microservices.

There are still many open questions that I did not talk about in this article, questions like how do we scale complex distributed systems? How do we make the system reliable? Who watches the watchmen(orchestrator)?
Stay tuned

--

--

Marco Bartolini

Father, Principal software engineer @Workday, focusing on my kids, microservices, cloud and CD.