When he wasn’t chasing the Pink Panther, Inspector Clouseau had a way of testing and sharpening his reflexes: he employed a martial arts expert called Cato Fong to attack him at just the moment when he was least expecting it. The aim being to ensure Clouseau would always be alert to an imminent attack.
This generated lots of laughs for Pink Panther audiences but the same principle, under the moniker Chaos Monkey, is being applied by Netflix engineers to their own network infrastructure. Netflix created Chaos Monkey when they moved their operation over to the AWS cloud. It targets Auto Scaling Groups, ASGs, within the cloud and terminates virtual machines at random. To date according to Netflix it’s terminated in excess of 65,000 virtual instances.
Attack of the Chaos Monkey
Chaos Monkey tests the administration and engineering team and drives them to build an always-ready, robust service infrastructure. Just like Cato, it attacks at random shooting the engineers’ stress levels through the roof but forcing them to adapt and learn. They have to live with this constant threat and prepare for the next outage.
Chaos Monkey came to Ireland not so long ago when Pavlon Baron gave a demonstration during the first Erlang Factory Lite event in Dublin held at AOL’s HQ during the summer. And even before that Ireland’s own John Kavanagh released a version of Chaos Monkey for the Slim Framework into the wild over 2 years ago.
The ultimate benefit of Chaos Monkey is that it makes network infrastructure more resilient to the natural outages that will always happen in a live business operation. Ironically, it will enable the administration team to be in control of the network or cloud infrastructure because they will have a deeper and wider understanding of a network if they’re always fixing it.
It’s better to find out if a new deployment or configuration doesn’t work sooner rather than later; later could be 4am on a Sunday morning or Christmas day. By default the app will cause outages on non-holiday weekends but it can be scheduled to terminate instances at any time and of course it makes more sense to have a team of administrators around so they can learn from what’s going on.
The ultimate aim is to ensure that regardless of what goes wrong the service can still be delivered and Brand integrity maintained.
But the long-term value for the company is the knowledge accrued from disruption and failure. This is another example of the entrepreneurial culture extending into the 2013 corporation. Good entrepreneurs are serial entrepreneurs, with a portfolio of failed projects and a vast dataset of knowledge about their industry as a consequence.
So putting IT staff through a simulated entrepreneurial experience can create a team of switched on, keen minded, problem solvers.
And Netflix hasn’t stopped with just a Chaos Monkey – it has released a whole Simian Army of Chaos primates to help engineers keep their cloud infrastructure operating in top form!