Engineering Robust Server Software 2025 Guest Lecture Slide Notes
April 2, 2025•3,763 words
Incident Response
- Hi I'm Michael Ansel
- Thank you for showing up at 8:30 in the morning
- Today, I'm going to be talking about Incident Response, which is a fancy way of saying that something important is broken and we need to fix it as quickly as possible. It's usually a surprise and depending on the job, this could be during the day, overnight, or on the weekend. Whatever it is, it's bad and you need to act now.
Who is this guy
- Let's start with a little bit about my background and where my advice is coming from
- I graduated from Duke a little over 10 years ago, in fact:
- I sat in those same chairs. And it's kinda weird to be here instead of back there.
- I imagine most of y'all are somewhat familiar with my first employer thanks to the lecture on High Availability of storage controllers.
- NetApp makes storage appliances for big businesses.
- I interned at NetApp down the road in Research Triangle Park and continued on after graduation
- My job was to automate the configuration process for hardware devices, which meant a lot of troubleshooting where I was constantly trying to reverse engineer the internals of enterprise hardware. New to all enterprise hardware, so lots of looking at systems I didn't understand and trying to make sense of them.
- I moved out to California to work at Box, another enterprise data storage company, but this time in the cloud, as a Site Reliability Engineer
- As an SRE, my job was to be the primary responder anytime anything went wrong. If the website was down, my team was on the hook to figure out why and how to fix it. When things weren't on fire, we worked with development teams to build more resilient services; I even have a patent on an incredibly effective high availability architecture we created.
- Over time at Box, I leaned more and more into the security side of the business, and then made it a full time job when I headed over to Twitch as a Security Engineer.
- As a security engineer, my job was to build tooling and patterns for the company that enabled them to do the right things automatically, just focused on security rather than availability. One of my favorite ways of describing my job at both Box and Twitch is that I was always trying to make it easy and obvious to do the right thing.
- I later moved into management at Twitch and now I'm leading the entire security program at Lambda, but the goal remains the same.
STORY TIME
- Let's jump right into the heart of incident response with a story of an incident I worked at the beginning of my career. This was about two years after graduation and in my first year at Box.
- Customer using product to distribute software updates, seems fine
- Send an email every time a file was downloaded
- Basic multiplication: files x computers = a lot of email jobs
- Discovered unknown flaw that resulted in a super-linear increase in job processing time with the number of jobs in the queue; basically as there are more jobs in the queue, each job takes longer to process
- Put 100k jobs in, everything grinds to a halt
- Met with the engineering team; fix would take multiple weeks to redesign
- But, if we pull all the jobs and re-insert slowly, we can keep up
- Do that by hand for the rest of the day. By pulling the jobs, the site was back online and I was just working through the backlog of jobs.
- Next day, new burst of jobs, the site goes down. Pull everything and spoonfeed them back in all day long. Make a little headway day over day by starting to automate the process.
- Repeat each day until finally the customer stopped sending out software updates.
- Each day, I would get the site back up, but it was sitting on a razor's edge. If the queue exploded, the site went down. If I didn't process the job queue, important work wasn't getting done. Every morning, the site went down until I started manually managing the queues.
- Because of the timing on this, I wasn't actually getting out of bed: I would wake up to getting paged, respond all day long, and then it was time for bed when I finished.
- I know that might not seem like much now, but in a pre-covid world, it wasn't actually normal to spend all day working from bed.
- Whew, marathon, right?
Incident response is important
- Let's unpack this story
- Why was it urgent to get this resolved? Because the business was losing money as long as the site was impacted. Amazon makes it easy to do the math since outage = no revenue.
- Outages are expensive and need to be resolved quickly
- If Amazon is completely down, they lose a million dollars a minute.
- Clock starts before your phone, so you're already behind
- START THE TIMER
- Hands in the air, when I say go, most recent email from Dr. Bletsch or Dr. Rogers
- Amazon just lost a million dollars
- So yes, incident response is very important
The best solution is avoiding the problem, right?
- Since the best solution is to avoid the problem entirely, why don't we just design for 100% uptime and not need to fix broken things?
- Because we can't.
- In your High Availability lecture, you learned that if you make everything redundant, you can be resilient to component failure. This is absolutely correct...as long as you make everything redundant.
- Remember, that means everything, including your software, including the humans, including physics.
Space shuttle
- The space shuttle launch process is an example of what it looks like to design for 100% availability: if something goes wrong, people die in very public ways, so they pour incredible amounts of money into ensuring that doesn't happen.
- With the shuttle, decisions need to be made with sub-second timing, which means it needs to be done by a computer.
- Since it is done by a computer, there is no room for surprises or human judgment, which means that the software must be perfect.
- In order to achieve this, they developed two separate launch control programs to run on the shuttle, written by two different teams that weren't allowed to talk to each other so that they were unlikely to have the same bugs in separate implementations.
- One implementation is run in lockstep across three separate processors, checking in with each other constantly. If they ever disagree, the second implementation forcibly takes over and executes a safe shutdown of the launch.
Challenger explosion
- And yet, with all of this redundancy and protection, the shuttle program has still had two catastrophic failures that resulted in the loss of all onboard.
- The failures happened in the non-redundant parts of the shuttle: the engines and the heat shields.
- So why are incidents unavoidable? Because you can't protect against everything. Use redundancy to protect against what you know is going to fail, and get good at incident response for everything else.
Just fix it
- Okay, so something broke, why is it so hard to fix it and move on? Why do we need an entire lecture about responding to things breaking?
Boromir meme
- If only it were that easy
- Too large, too complicated, too segmented
- Probably hard to grasp without industry experience, so I'm going to use a comparative example instead
Spaceteam
- How many of you have played spaceteam before?
- Co-op game in which everyone has a different task they are trying to complete simultaneously, and the only way they can complete it is with everyone else's help.
- This is a pretty good simulation of the chaos that ensues when there is a major outage. Everyone is running around trying to fix it and asking everyone else for help.
- Homework: it's free on mobile, download it, play with a group of 8 people
How do you make this better?
- That sounds rough. Surely there is a way to make it less rough, right? Yup, and I'll even go so far as to assert that with some practice incident response can be incredibly rewarding. Some of my fondest memories and best friends came out of incident response. So how did I get there?
- There are two key things I would like you to always keep in mind during an incident
- Write down what is happening and Fix the Failure not the Fault
- We'll discuss each one in turn.
Write down what is happening
- If you do something, write it down.
- If you see that no one is writing things down, start writing down what you see happening.
- If no one is saying anything, start asking questions.
- If your questions go unanswered, state what you think is happening and something magical will happen: Engineers can't abide wrong information in the world so someone will either jump in to correct you or you will get confirmation that you are correct.
- What should you be writing down? Obviously you want to know what people are doing, but you also want to know what people are thinking. What questions are people asking? Most interestingly, which questions are going unanswered? Where are people lost or confused?
- Where should you write things down? Anywhere, as long as everyone can see it. Open a Google Doc, send a Slack message, whatever works for your team.
- Let's look back at the queuing story.
- If I had been sharing my work as I went along, someone else could have taken over the next day instead of forcing me to continue. Alternatively, they could have taken my notes and spent the day building the automation for me while I cleaned up the queues. Either way, we could have spread out the load and made it less exhausting for me.
- Outside of just helping in the moment, as we'll see shortly, these notes are invaluable after the outage is over.
Fix the Failure not the Fault
- Let's move on to the second thing you should keep in mind
- Fix the Failure, not the Fault
- You may recall that I never actually fixed the problem, and yet I got the site healthy again on the first day and each day afterwards.
Error propagation
- We're in an academic environment, so let's have some academic rigor. This is the standard model of error propagation.
- Dr. Sorin's Fault-Tolerance class covers this better than I can; highly recommended; CPU fundamentals are useful even in the cloud
- Faults cause Errors, which propagate to produce Failures
- A fault is the starting point. Something unexpected or unintended has happened at a low level. This could be anything from unexpected input to your software or gamma radiation causing a bit flip on a circuit board.
- A fault isn't inherently bad. It's just something that doesn't contribute towards a correct output.
- Faults cause errors, which can then cause other errors. Errors are just the intermediate states of things going more and more wrong, but not yet visible to someone outside the system.
- Eventually, an error will turn into a Failure, which is an unrecoverable bad outcome from your system.
- In IR, you see a failure and want to stop error propagation
- Sometimes you fix the fault, but there's usually a faster way
Substation
- Let's take a look at another situation to make this a little more concrete. Imagine someone shoots up an electrical substation and knocks out power for thousands of people
- This isn't hypothetical. This happened down the road in Moore County North Carolina at the end of 2022.
Moore County power outage
- READ THE SLIDE top to bottom
- But in an incident, you don't actually know what the fault is, so you're actually reading from the bottom up
- As soon as you find a way to stop error propagation, you're done
- In this case, we need a transformer, but our current transformer is beyond quick repair.
- So, how did Duke Energy solve this?
Phantom Menace
- They took a tip from George Lucas
- HIT PLAY
- Just like in the movie, they've mitigated the impact, but need to do heavy repairs before they can continue.
- They survived, and that is what Fixing the Failure is all about. Survival.
Mitigate the impact
- Something went wrong, and with heroic effort and some duct tape, you got it working again.
- Incident response doesn't stop with recovery, in fact it's only the tip of the iceberg.
Retrospective
- It's time to review what happened. Many monikers: READ THE SLIDE
- These all mean the same thing
- Why did it happen? Why didn't we catch it before it got so bad? How can we make it less painful to fix in the future? Just how bad was the outage? Can we prevent this from happening in the future?
- The nuance of effective postmortems is yet another lecture. Lots of great resources on the Internet and I'll include some at the end.
B17 photo
- The B-17 bomber was used extensively during World War 2, but it had a rather mysterious crash record. A significant number of crashes didn't happen in battle, but instead happened right as the planes were landing back at home. Not on the way home, but at the airstrip. Right when they were about to land. That's weird. You would think if you've survived the battle that you would be fine, or at least not always fail at the same moment.
- The military tried a bunch of ideas to mitigate the impact, mostly unsuccessfully.
- They ended up stopping the crashes in a rather effective way: they ended the war.
B17 dashboard
- As it turns out, in the cockpit, there were two identical switches right next to each other: one switch would "do the right thing" for landing, and the other would cause you to suddenly fall out of the sky and die. Tired pilots were accidentally hitting the wrong switch and crashing the plane. Over and over and over.
How can you fix this problem?
- Without telling you the answer, I want to hear from you.
- With the data provided, how would you stop planes from crashing?
- TAKE SUGGESTIONS
- Training: great suggestion and maybe a bit of a trap here. I will assert that training is never enough and that you will always find yourself back in a postmortem asking the same question: why weren't they trained on this? why didn't they remember this from the training? You should always assume that you have intelligent humans who want to do the right thing, so if something goes wrong, it can't be the humans' fault. Instead, there must be something more you can do.
Shape Coding
- Make the two different controls as different as possible so that accidents were effectively impossible
- And then make it the exact same on every plane everywhere in the world
- As a fun aside, a different failure in the B17 prototype was the direct reason for pre-flight checklists
737 MAX
- Keeping with the airplane theme, a few years ago, Boeing's latest airline jet, the 737 Max, had a bit of an issue: it was crashing unexpectedly and killing people. Boeing blamed it on the pilots (MAYBE which based on what we just talked about should be triggering you instantly). Among other problems, one of the major problems here was that Boeing had broken a previously streamlined workflow.
- Normally in a commercial aircraft, if the computer goes haywire, there are buttons right next to your thumbs that you can hit that will instantly disable everything and give the pilot full control. This came about for a reason, but Boeing forgot that. They added a second switch that you needed to remember to hit.
- A lot of people died because they didn't remember why the system was designed the way it was.
Not just What, but Why
- It is crucial that you think about not just what went wrong, but why it went wrong so you can fix the why.
- Let's return to the queuing story: why did the job queue unexpectedly fail when there were a large number of jobs?
- TAKE SUGGESTIONS
- Simply because we never tested it at that scale. It was built years prior when the company was much smaller and we never characterized it's behavior under this new level of load.
Mitigate the impact et al
So you've done these things *READ THE SLIDE
*Getting ready for next time
Time to move on to the next step: getting ready for the next incident
We established earlier that incidents are unavoidable, so once you're done with this incident you're really just preparing for the next one
Rewrite history
- This is where you are going to leverage your written notes heavily: you are going to look at what happened last time and figure out how to do it faster
- If you assume that the same thing happens, and you react the same way, then you'll get the same result. Cool, but this presentation is about doing better.
- The second time around, we know more than we did before, so I'll bet there are a few things that don't need to be figured out again, you can just do them instead of re-inventing the wheel each time.
- So, write those things down and then whoever is responding can just "do the thing".
- These are called runbooks or playbooks or incident response procedures
- They amount to doing your thinking as a team when you're not under stress and ensuring everyone has the same knowledge.
- It's basically like preparing for an open book exam with unlimited notes.
Optimize the system for Future You
- Almost every incident results in the participants learning something new about how systems interact.
- Look at your notes again: when were you asking questions that you didn't know how to answer at the time?
- How much time did you spend trying to find the failing system?
- Most incidents are solved by just finding the right graph or log entry that shows "when X happens, everything falls apart", so make that as easy as possible.
- Build better instrumentation so you can see what is happening inside the system,
- build dashboards that collect together relevant graphs,
- and create alerts that automatically draw your attention to a failing system.
Build better duct tape
- Just like we can't prevent incidents from happening, we can't prepare for all incidents that might happen.
- Thankfully, we can look at what has broken as a strong indicator of where things might break again in the future. If something was complex enough to break once, it will probably break again.
- So build tools for the things that have broken.
- Callback: Moore County power outage and the mobile transformer that saved the day; that transformer wasn't created proactively, it was created because transformers had failed before. Getting shot up was new, but transformer failures aren't.
- Callback: Box Queuing issue.
- I built the tool, and it saw regular use at Box over the next several years.
- One of the engineers saw that and remembered it, so when they left to go to another company and built a system with a queue, they copied the tool.
- Surprise, that company was Twitch, so when I showed up and that system was handed off to me, the incident response tooling I needed was already built.
- Communicating really does pay off
Thank You - Future You
- Future You appreciates you paying attention to this presentation and fixing all these problems before the next incident
I've been talking about security the whole time
- It turns out all of these concepts apply to security as well, you just end up doing most of the work proactively.
- You don't want to wait for a threat actor to compromise your systems, so you do it yourself through penetration tests, threat hunts, and bug bounty programs.
- You work backwards from the unintended exposure and understand how you could have detected it sooner and prevented it from happening.
- You build procedures and runbooks to enable you to remediate exposures as quickly as possible.
- You build forensic capture tooling so that you can cut off a threat actor without losing the ability to analyze what happened.
- It's all the same, just with higher stakes.
Conclusion
- READ THE SLIDE
- Incident response is important, unavoidable, and difficult. It requires intentional effort to get good at. It's also a lot of fun.
- Write down what is happening in the moment. What are people doing? What are they thinking? What is slowing them down?
- Mitigate the impact first (fix the failure, not the fault) and then use your notes to fix everything else when the business is no longer losing money
- Review every dimension of the incident: How bad was this? What actually went wrong? How can you mitigate faster? How can you prevent this from happening again in the future?
- Do Future You a favor. Future You is tired of this. Make Future You's job easy so Future You can get back to the party or sleep or whatever they were doing when everything broke
Questions
- That's all I've got for you today. Slides and additional reading are available at the link on the slide.
- I'm happy to answer questions about anything in the presentation, incident response, Site Reliability Engineering, Security Engineering, management, frankly anything that would be helpful to y'all.
- I'm in no rush to leave, so I'll be here until they kick me out.
- Thanks for listening and good luck with your first incident.