Profile avatar
norootcause.surfingcomplexity.com
Student of complex systems failures, resilience engineering, cognitive systems engineering. Will talk your ear off about @resilienceinsoftware.org
1,010 posts 1,454 followers 521 following
Regular Contributor
Active Commenter

Can we put the AIs on-call yet?

Hot take: a change freeze is itself a type of change

The incident that just happened has exposed the most salient risks. But that’s not the same as the risks most likely to bite you next. If your action items crowd out ongoing work to address other risks, you might make things worse.

Holy shit, this guy will NOT shut up about Fight Club.

New blog post: Not causal chains, but interactions and adaptations surfingcomplexity.blog/2025/05/19/n...

Tradeoffs, tradeoffs everywhere

Weird how something can happen and several months later something completely unrelated happens. Really makes you think.

Right now, your system is broken in a thousand different ways that you don’t even know about, without obvious symptoms. One fo those breakages led to your last incident. A different existing breakage will lead to your next incident.

Schools are on the cutting edge!

I hear “we need to teach kids AI in school” and, I dunno, man, sounds to me like the problem is that these kids have learned all too well how to use AI in a school context.

Another fun exercise to think through: what are the things that your incident responders are most likely going to struggle with during the next high-severity incident?

Incident metric I’d like to see: how aware were we, in advance of the incident, about the risk that manifested as that incident? For each incident, assign a numerical rating (e.g., 0-100) of a priori risk awareness, and then look at how well your organization actually understands its risks.

Tell the story of the incident from multiple perspectives. Multiple people were involved, and they each have a different view of what happened.

The system and the environment

In my experience, if you ask someone “how are things going reliability-wise in your company?”, the response generally falls into one of two buckets: 1. 🔥 Everything’s on fire! 2. 🤷 We don’t actually know how things are going

I’ll start believing that the RCA approach to incident analysis provides genuine insight into the nature of complex systems failures when users of it stop being surprised by the fact that they keep getting surprised by incidents.

Why do they call them surgeons and not operators?

It’s wild that queueing theory is a whole research field. “You know the whole ‘waiting in line’ thing? Like, in the grocery store? What if we went super-deep into that?”

You know those work meetings that you would rather not attend but feel obligated to show up to? Imagine being able to send an AI ambassador to these meetings in your stead. Eventually it’s probably just all AIs in the meeting.

FLAT EARTHER: *drops a bunch of change* and that’s how god created our solar system

Who called it insomnia and not resisting a rest

Have any journal editors been caught using LLMs as reviewers yet? How would we ever even find out?

Folks, why aren’t we using LLMs for generating schedule estimates for development work??? We all loathe making those estimates, and we can just blame the LLM if the actual development time deviates from the estimate. “The LLM must have hallucinated a bad estimate”.