The service sector is taking on more and more responsibility for sustaining Ireland’s recovery. That being the case there’s a perennial question among business thinkers that’s becoming more pertinent: what makes an ideal service? A common and cogent answer is one that unfailingly anticipates the client’s needs.

This certainly would make a business in the service sector successful but a more fundamental quality is a service that’s uninterrupted and uninterruptible, one so perfect in its continuity that not even an act of God can take it down.

Netflix are trying to achieve this in a counter intuitive way. They’re letting loose a range of disruptive programmes in their cloud environment that simulate the effect of a troop of monkeys running wild in an engine room.

The applications, called the simian army and which are now open source, identify weak points in Netflix’s infrastructure, simulating systemic problems that drill their engineers in coding solutions and maintaining Netflix’s 100% up-time. In the real world having bugs like these could bring the operation to a standstill.

For service providers like Netflix, the cloud offers an advantage where programmes like Chaos Monkey and the other members of the Simian Army can simulate every conceivable operational glitch and failing without affecting the Netflix service delivery itself. While Chaos Monkey is running riot in a controlled simulation, Netflix’s service is running smoothly in the background.

In practice the simian army is monitored while it’s probing and testing the infrastructure and there’s a team of engineers at hand to meet any challenge coming from the exercise.

netflix-simian-army

The objective is that eventually every conceivable weak juncture in the system will be identified and the solution will be automated and implemented. Ultimately, there’ll be no nasty surprises for the team running the service.

Gorillas in the Midst….

And Neflix’s army is scaling: there’s a new recruit called Chaos Gorilla. Last year we wrote about Chaos Monkey and while it can bring down instances and generates single points of failure, Chaos Gorilla simulates an outage of an entire Amazon availability zone (AZ).

Scaling from monkey to gorilla reflects the difference between an instance failing and a complete AZ failing: for customers at the end of the service chain it’s the difference between a manageable annoyance and a full blown catastrophe – costing Netflix in brand reputation and customer satisfaction.

Using Chaos Gorilla to probe for problems in the AZ means the engineering team is taking responsibility for their full cloud environment and not just leaving it to AWS.

The closest sibling to Chaos Monkey is Latency Monkey: this programme simulates a degradation of service that tests whether or not services further down the service chain respond adequately. Latency Monkey, in the same vein as Chaos Monkey morphing into Chaos Gorilla, can scale and simulate an entire service going down. The engineering team can test solutions and coding drills to resolve the issue.

Continuous deployment of code by cloud companies not only test the operability of service delivery but its security as well. And just like the service’s operation, security has to be constant and can’t be compromised by constant product updates.

Security Monkey works in just the same way as Chaos Monkey, testing the system that’s going through high velocity change. But it operates in the security sphere and simulates security configurations allowing the team there to identify how it might impact on their AWS environment and automate solutions to a wide range of problems.

“Its [Security Monkey’s] approach fits well with the general Simian Army approach of continuously monitoring and detecting potential anomalies and risky configurations”.

Patrick Kelley, Kevin Glisson, and Jason Chan (Netflix Cloud Security Team)

And Security Monkey’s value add to Netflix security doesn’t end with simulations: it provides the security team with a reliable configuration history for forensic investigations. It also scales right across Netflix’s AWS services.

Not all Netflix’s Simian helpers are agents of Chaos: there are hugely helpful programmes like Conformity Monkey that searches for and shuts down instances that don’t adhere to best practice, as they can create problems later if they’re not dealt with in time. And Doctor Monkey monitors the vital signs of instances running on the cloud. If it finds any unhealthy ones, they’re removed and terminated. The Janitor Monkey is barely the same species as Chaos Monkey: it checks the cloud for clutter and resources that are left lying around and gets rid of them.

netflix-stats

Dawn of the Planet of the Cloud Apes

Cloud engineers have the twin challenges of automating their services and managing constant change. They must also be able to manage the rapid scaling of demand that comes with the vastness of Netflix’s global service. To do all this the engineers have to know their infrastructure intimately and the only way to do that is to constantly fix it after simulated foul ups and downtimes. What’s important in accruing this knowledge is anticipating problems and responding to them with tried and tested trouble shooting drills.

And with the Simian Army now open source and being taken up by other companies, we could see the Dawn of the Planet of the Cloud Apes in the very near future…