Release It! Second Edition by Michael Nygard
This great book covers a variety of topics in the space between the code being built and the user interacting with it.
It covers four slices of the journey:
- Creating Stability
- Designing for Production
- Delivering the System
- Solving Systemic Problems
Each section starts with a case study, or war story, from the author's experience, which serves to illustrate what can go wrong and how quickly. It also gives some pointers of the sort of factors to think about when considering on how to go about approaching them.
Some key messages I took from the book:
It's often more important for your team to be flexible rather than efficient. In order to respond quickly to issues in a deployed system or to address a new requirement quickly you will need to move through the development process quickly. Having a team that can code-test-build-deploy without handoffs to other teams can speed this up.
My current team is unusual in my organisation in that we can operate independently as our system sits off to one side from the main platform. We can go from a bug fix being committed to git on a dev's machine to the fix being in Production in 30 mins. While it may on the face of it be more efficient to have specialists in different disciplines (having a team to manage the server/cloud and the deployments is the commonest example I guess) this will come at the expense of flexibility when you may need it the most. "A container ship trades efficiency for flexibility".
- Your development eco-system should be treated as a Production environment. This is a frustration that I have seen at most places. Internal package feeds should be able to deliver the packages requested. Build agents should be available and kept up-to-date with the requirements needed to build the software. Machines, be they servers/VMs or developer laptops should be up to the job. AzureDevops should be up (recently that's been a bot off the mark....). And more importantly an outage in this Production environment should be treated with the seriousness of one in the customer-facing one.
In a bit more detail here is what I took from the sections
Plan for failures. Use CircuitBreakers. Build in Crumple Zones. Couple loosely.
"A robust system keeps processing transactions, even when transient impulses, persistent stresses, or component failures disrupt normal processing. "
For every I/O call ask "What are the ways this can go wrong?"
Don't trust client libraries to handle connections cleanly.
Blocked threads are the main cause of responsiveness issues.
Use two-factor monitoring,ie, in addition to internal monitoring, monitor responsiveness from the outside to capture the user experience.
Make domain objects immutable.
Cache carefully: don't cache data that is cheap to get. "Keeping something in cache is a bet that the cost of generating it once, plus the cost of hashing and lookups, is less than the cost of generating it every time it's needed."
Stagger your cron jobs to avoid an avalanche at 0001 hours.
There are a number of patterns & anti-patterns for system stability.
Automate what you can but build in limits so that you don't end up automating your operation to a halt.
Designing for Production
SiteScope can be used to simulate a customer base's traffic.
You can recover much more quickly if you can restart components/services rather than whole servers. Remember that building up the cache can be what delays the restart, or rather the time until which the service becomes useful.
Beware of the differences in internal clock times between servers. Use a NTP server instead.
It's hard to debug in a container; log to an external target.
Manage your dependencies; don't download from nuget straight into Production.
Log widely to give transparency: "Debugging a transparent system is vastly easier, so transparent systems will mature faster than opaque ones"
When your system is overloaded you need a method to shed the load, to try to help you recover. You need to be able to do this early on in the request handling pipeline, not after it's consumed a lot of resources.
He introduces the idea that monitoring isn't just about system health, in that we want a healthy system. It's also about the financial health of the organisation:.
- "We should build our transparency in terms of revealing the way that the recent past, current state and future state connect to revenue and costs"
- Check the queue length as a non-zero queue means something is slow. That is a potential loss of revenue.
Canary deployments (to a susbet of your VMs/instances) limit the risk of a deployment.
Security: treat "Unauthorised (403) as Not Found (404)" to prevent an attacker finding a door to break into. It's easier to break into something if you know there is a locked door rather than a wall.
Delivering the System
He speaks a lot about version handling and how they relate to upgrades, eg, of schemas and of document structure in NoSql databases.
Solving Systemic Problems
Make sure QA are testing to warranty stability and functionality in Production, not just to work in a much more limited QA environment.
Code adaption: be prepared to retire the B in an A/B test. Don't starve A of resources in attempting to make B worthwhile.
Microservices: use with care. You probably don't need them if you don't have scalability concerns as the debugging overhead can be considerable.
Chaos engineering: if you dashboard is all green then your monitoring tools aren't good enough as something somewhere will be below par.