how we went from being austronaus to being mission control managing systems in a age of dynamic complexity -- Laura Nolan # intro co-author of the "the sre book" from google senior staff at slack # All cloud providers have outages even with full hardcore designs to be fault tolerant to full DC failures, they sometimes have large failures. on the other side, a single lowly server, with 4-5y uptimes are not unheard off. # The old ways configuration by hand load-balancer backend pools statically config no auto scaling ... # Times have changed Automate everything job orchestration autoscaling routing, failover, balancing traffic We have largely succeeded # other pressures better perf and latency, focus on tail latency reduce toil of managing systems react faster to failure consistent prod (infra as code, no pet systems) reduce compliance risks related to humans accessing prod # Dynamic control pane architecture pattern basic: a service pool (emits signals) a signal aggregator a controller (reads signals), acts on the service pool This is a very common pattern, this is a basic autoscaler, or HPA or whatnot. this is now a core part of your prod system, don't keep this in "janky python" Now you can also have: a global controller ontop of a multiple set of basic ones that are in distinct DCs or zones.. For example a global DNS loadbalancer you can now deal with inbalanced systems and even out the load. Another example is Google's global Sw-defined network used for batch/bulk. there is a global optmizer and bandwith broker it receives metrics from zone-local collectors of metrics it acts on Traffic-Engineering servers on each zone # This is just not automation This is mission-critical systems. This is the automatic operator of the critical systems. "We don't run our systems, these systems run our systems" We loose "direct observation" and "mechanical sympathy" with these systems. # Now lets talk about Control Plane Failures - December 24, 2012, AWS Elastic LB See: https://aws.amazon.com/message/680587 accidental deletion of state Full recovery took 24h Post incident action item was: lock down write access to the ELB control plane state Operators need mental model of: - how the system works - how the automation system works So, this is 4x harder to do. - April 11, 2016, GCE See: https://status.cloud.google.com/incident/compute/16007 Full network outage This was a propagation of a IP Anycast that is busted. only when the propagation reached the last good system, the outage starts. so, lack of signals/metrics that things are about to go bad. Multiple network control planes that interacted poorly: - canary system - control plane using it - delayed propagation of changes w/ delayed sideeffects Classic complex systems failure involving multible bugs and latent problems Testing these systems are a real challenge. - June 2, 2019: Google network outage elevated packet loss for 4h25m See: https://status.cloud.google.com/incident/cloud-networking/19009 Maintenances are common and automated. This is the "usual work" of a network system. this is another case of multiple bad things happening, time to detection was fast. time to repair was high becaue the network was in the shit, the global state was damaged with the full outage. There was loss a global state that didn't have a "checkpoint" system or similar. (I remember reading this one, and it was a very nasty one..) # Learnings Testing failsafe or fail static behaviour is scary. We're adverse to "fail" systems to run on failsafe just for normal operation. (Imagine if they did, and caused a global outage..exaclty because of a bug on the failsafe..) Use regional or zonal control systems. Avoid global systems Reduce blast radius Test your control planes really really hard. Plan for time needed for operators to stay familiar with the underlying operations. no matter how much automation, you need a disaster recovery plan, you need to EXERCISE your plans. We need DRILLS. (added by me: Main lisbon's hospital runs every saturday on the power generators to ensure it can run without main power) they DRILL IT WEEKLY!! SOMETIMES HUMANS ARE BETTER it is good to not fully automate some things. build tools, but do they need ZERO intervention? Make all your control systems to log into the same place. they are working independently, but we need to understand the sequence of events and relate them if needed. Have a "KILL SWITCH", consider a "switch to manual"