Refactoring to maintainability

Safety I and II

The Setup

My wife is currently studying a Masters (MSC in Health and Medical Simulation). As part of this course she just had a subject in which they had to research about the concept of Safety II, which is a change of approach in how to deal with medical safety. I watched a couple of videos with her, and I became enthralled with the idea. I could see some parallels to development in the software world, so I decided to investigate a bit further.

The Paper

And because I have learned that you should always go to the source, I went for the white paper created by Hollnagel, Wears and Braithwaite, which you can find on the NHS

At the core of Safety II is the recognisition that there is a difference between complexity and complication. And that sometimes, there is not a single easy answer.

I remember the case of a junior doctor that was struck off from the registry (not being able to work any longer as doctor) because the death of a six-year old boy. But once you started to read the whole series of events (for example at the BBC website) you start to realize that it wasn’t a doctor making a mistake, but systematic failure at multiple levels.

Blaming a single person is the easy way out. You a have a linear causal events that lead to a person making a mistake. But under the premise of the white paper, that doesn’t work for complex systems (like the body or a hospital).

In Safety II the recommendation is, rather than looking at why thing go wrong, is to look a why things go right, with a premise that, because of our adaptability, we humans do manage, most of the time, to make things work while the system is failing around us.

And additional idea is Work-As-Imagined vs Work-As-Done, which represents the difference between what we think happens and what actually happens.

Some thoughts

I do recommend you to read the paper, for me it has been captivating. And there are some ideas that I want to investigate rapidly.

First, adaptability, which is one of the two, I believe, central tenets of the paper. We do believe Humans are the most adaptive species, mostly because we don’t depend on slow evolution adaptation any more. In the specific terms of day to day situations we can easily think out of the box, improvise or just wing it.

But I think there is a second inherent human capability that helps a lots. There is one episode of House M.D. where the main character is himself lying in an hospital bed after an operation, and keeps getting reads from the vitals machine to his side. He calls the nurse and informs her of an imminent … heart attack??. Even when the readings were not of a heart attack, he was able to recognise the patterns that would lead to one.

I am not up-to-date with the current thinking of Chess and of Chess Engines, but I remember the idea of pattern recognition as the way that we humans use to discern what was needed to beat the opponent. Chess engines, in contrast, were working on pure brute force algorithms (though is possible that with Machine Learning they have moved away from pure brute force)

The other tenet of the paper we have talked about earlier: the difference between complex and complicated. The later is a linear relationship between cause and effect. It is deterministic in nature. The former doesn’t have that linear relationship, maybe not necessarily fully stochastic, but makes assigning cause and effect more difficult.

Systemic issues abound everywhere. And sometimes we don’t even think of those issues as issues at all. As the original paper is about hospital safety, let me say this: We know that long working hours do affect negatively the cognitive process, and yet, in hospitals, Doctors and Nurses do 12 hour shifts. Do you think that is safe?. Most of the time things work out, because other parts of the system don’t fail, but then you get the conflation of several of these issues all at the same time that end in the dead of patients.

How unusual is for multiple events to unluckily happen simultaneously? Look at the issues, for example, of the Boeing 737 Max, where the final NTSC report identified 9 problems that led to the accident. Any of those 9 could have stopped it.

Conflating a bit systemic issues and human flexibility, I am always amazed that there are no more deaths on roads that we have. The amount of near misses that I see every time I am on the road is shocking.

The Setup (part two)

This is one of my favourite sayings, which I heard/read from Kent Beck: Simple is not Easy

In the white paper the authors make a distinction between the ideas of Complexity and Complication. If you have seen, for example, the Cynefin framework you have been exposed to the idea. How to investigate and how to solve each one is different.

We know that the more moving parts a software system has the more complicated is the system. We have measures for our software like cyclomatic complexity.

But in most cases we are talking about complicated systems when we talk about tech. Enough expertise will solve the issue (but, as mentioned in the previous post, that expertise is difficult to acquire).

Nearly every thing else that we have to deal with fits into the realms of complexity as we have to deal with human behaviour.

Work-As-Imagined vs Work-As-Done is present everywhere I look. Two clear examples: the way that most projects are planned is based on the assumption of how work is going to be, which tends to diverge massively from what happens once you start developing a product/project; the other is whenever an activity like Value Stream Mapping is used you start seeing how people think their systems and companies work is different from what actually happens.

My own personal experience with issues in the computing world is that nearly every error or mistake that I have seen is a series of missteps, any one of which could have stopped the error. We have tools and processes that can help with removing most issues, and what is left happens because either the system is complex enough to not be able to see the consequences or because people have lost the flexibility to remedy those errors.

I have started to recognise some patterns: if there are heavy processes, humans are not able to use their adaptability to maneuver, you have to go by the book; if there are strong hierarchies, the warning calls are not heed; a non-learning culture tends to leave a trail of issues behind. And while usually each possible issue doesn’t cause any kind of major problem, sometimes they conflate into a big error.

As a last point, I found interesting thinking about Observability, how it feels like it fits with the Safety-II approach, as what it allows you to do is investigate how things are working when everything is right so you can see the signs ahead of things going wrong.

Next steps

Hollnagel has written a book, Safety I and Safety II that I am interested in reading.

Erik Dekker has talked about Safety II as well, and has a book of interest to me called Just Culture

Out of the studies done by my wife around this idea I discovered Humanistic Systems