"DevOps" purports itself to be the solution to this problem, so I wanted to take a look at this concept and try to figure out its merits. I'm going to try to answer a few basic questions: What is DevOps? What are its philosophical origins? Is there reason to think it can deliver on it's promise? Also, every new idea has its critics so I want to examine the leading criticisms of DevOps and see what useful ideas we can take away from them.
OK, so what is DevOps, in a nutshell? DevOps is a set of business practices that an IT organization uses to maximizes the total value it delivers over time when we consider both new functionality and reliability, availability, support and the overall operations cost structure. I know, I know -- you're thinking "gee, that's a little vague". It is, because I haven't said what the "set of business practices" actually is. There are several specific ones, and I'll get to them, in a little bit. I promise. But first, I want to cover the origins, so that we can see that the techniques come from a very principled place.
Philosophical Origins of DevOps
I'm a software engineer, and whenever I talk to my fellow IT guys, I often hear that software is special and unique. There is a lot of truth here, but it often stops us from looking at other arenas for ideas, like lean manufacturing. A lot of "agile development" can be derived by applying lean thinking to software. Whether or not the people that coined the term "agile" knew this consciously or not is an interesting historical question, but it's largely irrelevant. If somebody set off a memory bomb that erased the agile manifesto and all the praxis that sprang from it, we'd start off the next day and say "how can we eliminate waste from the software process, so we minimize everything but the direct creation of value as expressed by the voice of the customer". In fact, I argue that lean thinking takes us a little farther than "agile software development", because, as I think the industry is now coming to grips with, software development is not the entire value stream in IT. DevOps is the continuation of lean and agile applied to the entire IT value stream.
Lean came from manufacturing - it's no secret. When I pitch lean to people in IT I often get the reaction "but we aren't assembling cars". This is a standard mental model anti-pattern about lean. Lean applies to everything a business does, not just to manufacturing. Toyota didn't succeed with lean by just applying it to the assembly line. In fact, the real secret with applying lean to cars is that you have to put a lot of effort into making a lean assembly line, and eliminating waste in the new product introduction process is killer. Toyota's time to market with new models is what crushed the competition back in the day. The fact that their manufacturing floors had less inventory and shorter cycle times made their numbers look better, but people don't buy your cars because of your factories excellent financials, they buy them because your gets fresh thinking to market faster.
DevOps is the Manufacturing Engineering of IT
In manufacturing companies, the hard work of creating the assembly line is done by manufacturing engineering. Let's look at this idea a little more and you'll see that lean manufacturing engineering is really the key idea of the Toyota production system. These guys didn't wake up one day and say "let's set up some kanbans". You can't wait for the car design to be complete and then figure out how to build it. You have to bring the manufacturing mindset into the engineering world. Software development is more akin to automotive engineering. Who is most like manufacturing? Well, it's IT operations - the sysadmins. They keep the factory running. They like stability and highly repeatable processes. In this analogy, DevOps plays the role in IT that manufacturing engineering plays in a manufacturing setting.
So if DevOps is manufacturing engineering applied to IT, what are they really all about. Well, they have to be involved in the development process, solving the problems that ops will care about. And they are involved in the ops process, fixing things that development did wrong, so that we don't institutionalize them. It's always better to do the former, but you won't know what the former is until you've done the latter. Lean is a continuous improvement journey, it understands that you have to eliminate waste one improvement at a time. DevOps is about engineering the software so that operations is lean. So finally, we are ready to look at the specific techniques DevOps brings to the table.
What's In the DevOps Bag of Tricks
DevOps techniques fall into two buckets: cultural and technical. You have to do both because it's an "area of the rectangle" kind of thing. You probably have to do some of the cultural things before you have any hope of implementing the technical things, as otherwise the pointy haired bosses descend on the enlightened engineers and ask them why they aren't doing what they should be doing, which is, of course, a question they don't really want answered.
The Cultural Factors DevOps Incorporates
There are a number of cultural conditions that DevOps leverages. It both requires and builds on these:
- Create a climate of continuous improvement. Stop thinking its good to fix things over and over. Start thinking it's good to task people with prevention. Start trying hard to find and fix the bottleneck. Nothing else matters.
- Optimize the whole. Task somebody with both shipping new features and keeping operations smooth. These have to be techs who do stuff. Not just management. These people aren't devs and they aren't ops, they are both "DevOps" and neither. But they speak the language of both.
- Trust and Respect. Force ops and devs to spend time together. It's easy to bash people when they aren't in the room. The DevOps team from #2 will be making this happen because they will be owning things that cut across.
- Be obsessed with eliminating waste. A small team (6-10 people) should have 1 to 3 improvement projects in flight at a time, whose goal is to move the needle. Look at real results and real evidence. The team should prioritize and pick what to solve.
- Focus on recovery and prevention. Know your top cause or two of failure and work to prevent them. Deal with the rest by planning for failure. Focus on recovery time. Cut things in half like monitoring latency, app start, VM provisioning, etc...
- Build ops concerns into the software. Actually prioritize some developer resources to benefit operations over adding features for customers. Do it because it benefits customers in the long run. If this seems strange to you, you are afflicted with suboptimized thinking. Go talk to your ops. Life is not only about shipping features, but if you want to ship features faster, stop trying to ONLY ship features.
As you solve problems and eliminate waste, you need to be focused on taking the reward by shortening your cycle time and reducing work in progress. The less of this you have, the less total waste when the bigwigs change direction on you. The biggest wastes I've seen in IT is sudden "changes in direction" where projects that were going fine are shelved to free up people. If you ship quarterly, canceling something to go after the new shiny object wastes up to 25% of your organizations resources for the year. If you ship weekly, it hurts but you can deal with it. If you ship daily, nobody even notices.
So, don't ask how you can ship 10% faster. Ask how you can ship 10x more often. Of course, you won't be able to snap your fingers and instantly get there. There will be certain activities that by themselves take longer than 1/10 of your current delivery cycle. If you are on a two week cycle (10 business days), and you say you want daily releases, you might find you spend 2 days on QA and a day doing the actual release. The only thing that matters is cutting these things by an order of magnitude. Do not fire your QA and release people, that is not productive. Move the bulk of their work out of the critical path.
Think it can't be done? IMVU can ship 50 times a day. Etsy can ship 25 times a day. Flickr can ship 10+ times a day. They've solved these problems. Do what they do. Which brings us to the main course, in my view:
The Technical Practices of DevOps
Here's the contents of the technical bag of tricks that defines DevOps, at least as I see it in mid-2011:
- Infrastructure Automation. Use cloud and virtualization. Have standard images. No exceptions. Eliminate questions and "creativity" in the provisioning process. Use puppet and chef. Measure your provisioning time. Cut it to minutes or seconds.
- Standardized Runbooks. Each application and service that you build can't have it's own story. Developers don't get to change how their app is started, what it's installation looks like, where its logs go, where it's configuration goes, what container it deploys to. DevOps writes this once and stuff that doesn't comply doesn't ship, because we adopt:
- Fully Automated Deployments. The app should be in one artifact, it's configuration in another. The deployer takes one bit of information (the app/service name) and looks for updates in the one standard way. If they exist, they are pulled down and installed. One click deployments, then...
- Continuous Deployment. One click deployment is one click too many. Build a pipeline and when all the tests pass, no clicks happen and the code is promoted and installed.
- Advanced Test Driven Development. Not just unit and integration tests. I'm talking english language Behavior Driven Development (eg: Cucumber) Including for your UIs. 100% no exceptions. Have your quality/compliance team do audits to make sure. Even this is not enough:
- Behavior Driven Infrastructure. Use the test driven concept to loop back to infrastructure automation. Do not fill out request forms to get stuff. Ever, even if it's automated. Write Behavior Driven Infrastructure tests that deploy to your monitoring infrastructure that assert that your environments will be pingable, will be ssh'able, will be https'able, and will have the feature behavior defined by your BDD. Fail the test, provision, pass some / fail some, deploy, pass all tests, monitor forever.
- Minimal Marketable Features (MMFs). If it's possible to split your feature, do so. When developers finish stuff, assign them to in-flight MMFs first until those are "full". Stop starting and start finishing. Only pull new features into WIP when forced to because a free developer can't help anything in flight go faster. Management can juggle the roadmap or backlog all they want until it's WIP.
- Ship When Done. I've never understood timeboxed iterations. I call them calendar complacency. Many agile proponents haven't heard: Timeboxed Iterations are Dead.
- Runbook Automation - Take common failure modes and automate their responses. Have a socket leak slowly filling up your file handles. No!? Good. Monitoring them and automate the bounce anyway. Have bad memory?!? No!? Good. Automate a from scratch deployment anyway.
- Feature Flags and Dark Launches. Every environment besides production is waste. Get rid of it. Keep the virtual segregation of non-production code with feature flags. Prove it works in production before users see it with dark launches, not with expensive "production-like" staging system. Turn broken stuff off by unflipping a bit.
- Perpetual Beta. Let customers control who can see "Beta" stuff. Call this user acceptance testing, so that you can get rid of the waste. Let internal customers pull value by controlling when "beta" ends for a feature. Deliver features fast enough so there is always something in Beta.
- Automated Recovery. When web server #3 has some issue, what should the response be? Spin up a new VM, redeploy the app, put it in the pool, and throw away #3. Hone your ability to do this quickly. Measure it in seconds.
- Continuous Delivery. Remove all human intervention in the pipeline between the writing of your feature acceptance criteria as a BDD/BDI test and it's handoff to customers when the tests pass. Understand continuous delivery vs continuous deployment.
- Metrics. Measure stuff that matters. Time is money, so measure how long things take. MTTR is critical, so things like failover time, rebuild from scratch time, app start time, need to be measured. Performance is important, so latency and throughput, etc... should be quantified. There's only a couple other useful things: test code coverage is good, cyclomatic complexity of your code is good.
- Process Tooling - Have a single source control solution. Allow all techs to see everything in it, across all teams. Invest in CI environments, monitoring/BDI, and both runbook and infrastructure automation tools. DevOps should own administration of these, and be expected to use them to demonstrate waste eliminate in the IT process. DevOps delivers metrics data, as above, and owns the plan and its execution to improve those metrics by leveraging software process tooling.
IMVU can released 50+ times a day, etsy.com releases 25 times a day. Flickr releases 10+ times a day. These are big sites with lots of users and complicated use cases. These software teams have created really strong cooperation between their development and operations teams. They are probably running orders of magnitude more tests than you are in a fraction of the time, because they have focused on removing waste this way for years. So yes, these techniques work. All I've done here is itemize the techniques these guys say work, and explain how they are really just the application of lean thinking to the IT value stream.
However, like any buzzword in IT, "DevOps" has it's critics. Some of the criticisms are standard nay-sayer cynicism. Some criticisms are legitimate warnings not to turn a good idea into a mindless checklist solution. Let's look at some of the common push-back on DevOps.
One criticism is that "elitist sysadmin club to rebrand an existing problem" (see wikipedia for origins). Well, elitism is bad, ok. The fact that sysadmins might be involved is rather expected. Nothing in the culture or technical bag of tricks above requires you to be an elitist sysadmin to be successful, or to be the member of any particular club. Quite the contrary, this bag of tricks is open to anyone seeking to maximize the value IT ships. More correctly though, DevOps may very well be rebranding an existing problem. The fact that half the software industry still ships monthly or longer probably means that the previous brand didn't work too well on them. As long as there is complacency about the gratuitous waste in IT, rebranding the solution makes sense. The point is that the solution -- apply lean thinking to eliminate your waste one project at a time -- works, and is pretty much not likely to be displaced by a better solution.
Another deflection seems to be "we don't need to ship that often". This is just the voice of mediocrity. Solving problems and delivering real improvements is hard, so why bother. These people will fight to avoid change, but not too hard, because they are lazy and true resistance takes work. The CIO and his VPs have to lead. If they are questioning the value of greater agility, either go work somewhere else or wait for them to feel the heat from above. Businesses are demanding faster change and IT can't deliver that with quarterly and monthly release cycles anymore. If they try, they get bruised when the business changes its mind on things and weeks or months of effort is wasted.
Another criticism is that the whole DevOps "movement" is here to sell "books, training, conferences, the whole bit." and that "your organization will not be fixed by some sunshine-up-your-ass methodology you read about in a blog or hear about at a conference". See Ted Dzubia's DevOps is a Poorly Executed Scam blog. There's some truth to the notion that any good idea quickly gets over-marketed by people who peddle buzzword compliance. On the other hand, it's really cynical to suggest that anything used to sell books, training, and conferences doesn't work. Should we stop reading books and stop going to conferences? That's silly. And not everybody is selling something. I'm not. Real people with common problems like to share what works with each other.
Improvement is hard. No expensive talking head in a suit can tell you what subset of the 15 technical techniques above solves the top reason your organization can't ship software faster. But the people that ship 10+ times a day got there doing things on this list. Only you can figure out how to apply it to your organization. So get started.