/dev/spout: 2011

Sunday, August 21, 2011

Availability in the CAP Theorem is Not What You Think

I just realized that "Availability" in the CAP Theorem is not what I thought it was. My expectation is that if there is an explicit latency specification (eg: return within 2 seconds) that this is part of its "agreed function" that has to be met to be "up". Availability in the CAP theorem means something different.

I've been taking another look back at the CAP theorem, because I have to build a RESTful web service that supports US and European users, each with very low latency, and each with a pretty robust availability specification. If you look at the ITIL definition of availability, it's phrased as the percentage of the time a service can perform its agreed function. As I said, part of its "agreed function" is to return quickly.

In the real world, you measure availability with monitoring tools that poll on some frequency and verify that the service responds appropriately within some period of time. The monitoring tools open a thread, issue their network request(s) and have a timer that expires and declares the service is "down". No monitoring tool ever will leak a thread waiting indefinitely long. Anyway, back to the CAP theorem.

The definitive source of information about the CAP theorem is its proof, by Gilbert and Lynch. This is a short paper, and is approachable and useful if you aren't scared of proofs. In the section on Availabilitiy (2.2), they define it like so:

For a distributed system to be continuously available, every request recieved by a non-failing node in the system must result in a response. That is, any algorithm used by the service must eventually terminate. In some ways this is a weak definition of availability: it puts no bound on how long the algorithm may run before terminating, and therefore allows unbounded computation.

So taking a 10 hours or 3 days to return the answer is "available" under this definition. This really is more of a halting condition.

This realization changes the way I think about the CAP theorem a little. The definition of "partition tolerance" contemplates working correctly in spite of message loss. "No set of failures less than total network failure is allowed to cause the system to respond incorrectly". But if there is no bound on how long the system can take, it can deal with network partitions by retrying each message exchange until it succeeds. I suspect that the theorem is still true because this allows arbitrary delay in processing, and I suspect that this breaks down consistency, which requires total ordering of changes and each to appear to have completed at a point in time across the whole system.

The upshot of this, is that I don't think anybody really cares if a real world system has the availability property. We generally want something stronger and weaker: "most" of the time (eg: 99.9%), we want some kind of fixed bound on latency. It's stronger because of the fixed bound and weaker because we are allowed to fail some small percentage of the time. A real system probably only fails to meet such an availability response time bound when there is a failure, so we really are specifying "how rare is a partition".

This makes Abadi's PACELC approach all the more useful. A real world availability criteria is going to effectively tell us two things: what is the latency spec when we are not partitioned, and how reliable must the system be in terms of not having partitions. The PACELC approach has you trade latency vs consistency in the happy path (unpartitioned) case, and when packet loss happens, you trade availability vs consistency.

Sunday, July 31, 2011

Feature Branches are Not Evil. In fact, they Rock.

There have been several notable blogs lately decrying the faults of feature branches. Both Martin Fowler and Jez Humble have blogged on this topic. Their position contravenes both the proven success of massive projects like the linux kernel who have made forking a way of life and the tenents of lean thinking which tell us to do value added tasks (here, merging) just in time. Their argument is that feature branches cause you to accumulate code destined for release in a multiple places and this fragmentation impedes the benefits of continuous integration, makes refactoring harder, and generally inhibits communication via code. These code quality assurance practices should be pushed earlier upstream, but premature integration is not the best way to do this.

Both authors appear to assume an enterprise development setting where we assume all contributors are committers and they don't provide any explanations for why their advice rejects the standard practice of major open source communities who wholesale ignore it. The rise of distributed SCMs such as git happened specifically to allow the fork-modify-merge development pattern, and the pull request has become an accepted best practice. Git was specifically designed to support the linux kernel's development process which uses forking on a massive scale.

I will argue that there are good reasons to use feature branches in both enterprise and open source settings and that there are good mitigations to the problems they discuss. I see having feature branches as a form of inventory, so that they should be minimized in both their duration and their quantity, but like inventory, they occur in the real world to mitigate risk and variation.

I recommend a model where we create feature branches based on "minimal marketable features" (MMFs). By definition, such a feature is atomic in that no proper subset of it has value to users. In a corporate setting, a team should follow the advice to "stop starting and start finishing" by allocating available developers first to in progress MMFs, until they are "full". Only when we can't do this should we pull a new feature from the queue. By following these two practices, we guarantee that feature branches that support MMFs are of minimal duration and multiplicity. I don't claim this alone addresses the issues raised by Fowler and Humble, but it will go a long way towards avoidance. You simply should not be working on epics, but instead break them down into MMFs, put them in your backlog, and work like hell to finish anything you start as quickly as possible -- not by "trying harder" but by managing the process to assure it's lean: do not start something new if you can instead make something in-flight finish sooner.

I'd also like to point out that, whether we work in an open source community or an enterprise setting, we really don't know that a feature will be successfully completed and incorporated into the product until it is. Software development involves risk. Half of software projects fail, and there is evidence to prove it. Sometimes we encounter technical difficulties. Sometimes "the business" changes its mind about what it wants. Sometimes there are re-organizations. Sometimes we were trying to innovate and fail. Sometimes the lead developer gets sent to jail for murdering his wife or gets hit by a bus. In open source settings, the maintainers get final say on what goes in and they have to turn away contributions that don't pass muster. They try to accept contributions, but have to maintain their standards and design integrity. If you are the contributor, there is no absolute guarantee your work will be merged until it is.

Short of a feature totally failing, delay is a much more common risk. If we have two feature branches in flight, we simply don't know which one will ship first. We can estimate, predict, etc..., but we can't become trapped by those predictions. When multiple features are in flight at once, we should remove all non-essential work from the value path and let whoever gets there first win. Then the other branches will have to deal with rebasing, but notice the work of merging only impacts one of the branches. I'll talk more later about how to reduce the pain of losing the race.

If we apply "lean thinking", we want to move quality functions earlier, and do value added activities at the last responsible moment. The problem with the positions taking by Humble and Fowler is that they are treating a value added activity as a quality improving activity, so they try to push it upstream, doing it early and often. That's why they focus on refactoring, communication, and merge collision detection instead of focussing on minimizing the lead time to the next feature and total cycle time. Letting one of the teams not have to do the merge before they ship helps both.

Suppose that we have feature A and B, but deep into development on B we realize that there is a much better solution for B and that what we really want is C, so we scrap B entirely. This happens all the time in the real world, and it highlights that premature merging is a form of code inventory and a form of waste. It's bad enough that both teams had to deal with merging code from A and B, when the minimum waste solution would be to to impact the cycle time of only one of them. But here, since B is cancelled, the merge ended up being non-value added work, and even worse it was actively harmful, because to ship A we have to back out code for B. If you are a developer on A, you are pretty pissed off by this.

When you work on MMFs, which are worthless until done, the notion that feature branches accumulate code "destined for release" is an optimistic but unproven assertion. We obviously wish it was certain, but we also need to realize that shipping code that doesn't actually function is waste, especially if the quantity of such code grows over time. Dead code bloats and confuses our codebase, and is technical debt. I know that "feature flags" are all the rage now, but we need to be clear about what their benefits are. It's not so that we can release partial, non-working code, and hide it. This is a form of resource leak if we don't clean them up, especially when features are cancelled or delayed. I don't want 3000 feature flags piling up over the years. Feature flags mitigate risk associated with change by simplifying back outs when something goes wrong. More importantly, feature flags allow us to do limited real-world testing in the production environment instead of some expensive staging environment. Eliminating whole non-production environments is a whole lotta waste saved. They also give customers control by enabling pull, so it's their choice as to when features are activated. Splitting delivery and activation is really the key benefit. We should only deliver working features, a point Humble and Fowler agree with, but recall that if we work on MMFs there are no partial working features.

OK, so what about the difficulties with continuous integration, refactoring, and communication that Humble and Fowler are worried about? Suppose we end up with feature branch A and B in flight at the same time, how can we make it all work. I agree with Jilles van Gurp, who blogged about git best practices when we are forced to deal with this. I see there being two main issues: communication and early conflict detection. Refactoring comes in two forms: either I'm refactoring to support my feature, or I'm solving technical debt for it's own sake. The latter, I would argue, should be treated like any other feature, and in particular, refactoring as technical debt cleanup should also follow the MMF model, so that I'm not doing epic refactoring (pun intended). If you follow this, I don't think whether a feature involves some refactoring or not really matters in the final analysis. Feature A and B each introduce some changes, which might conflict. The only question is how you deal with it when they do.

I think the answer is relatively simple. I should create a special branch that automatically merges any live feature branches. Merge failures are reported back, and the teams should talk when they happen. Jenkins automates this, as described here. This integration branch is always expected to contain unstable development code. The key difference here is that I am trying to merge every time code is pushed to any feature branch. If one of my features has to be cancelled, I reset the integration branch based on the new set of in-flight feature branches (dropping the cancelled one). The key difference between this solution and what Humble and Fowler advocate is that I never expect this merge branch to be releaseable, but they do. When a merge conflict is detected this way, the teams have to talk about what to do. They have options.

So in summary, I think MMF style feature branches that "ship when done" work well and don't suffer the problems Fowler and Humble worry about if we use the right tooling to automerge and communicate.

Saturday, July 23, 2011

Mac Terminal Colors & Git Prompt

I always get slightly annoyed by the default color settings in terminals. Fedora had issues with this for a long time. When I got a Mac, I was rather disappointed to see that with a black background the standard color set was hard to read. Anyway, I dug into setting bash terminal colors and thought I'd share.

To colorize your terminal you need these settings in your shell. I put them in my .bash_profile:

export CLICOLOR=1

export LSCOLORS=GxFxCxDxBxegedabagaced

The CLICOLOR flag turns on color, and the LSCOLORS picks the colors for various types of objects in your directory structure. LSCOLORS is slightly different from how it works on linux, as a Mac follows the BSD way. There are 11 color settings concatenated together here, where each is defined by a pair of characters. The first character is text color, and the second is background. Capital is bold or enhanced for backgrounds. The "x" character means default. The color map is a=black, b=red, c=green, d=yellow, e=blue, f=purple, g=cyan, h=white.

The 11 settings, in order, are: directory, symlink, socket, pipe, executeable, block special, character special, executable with setuid, executable with setgid, directory writeable to others with sticky bit, directory writeable to others without sticky bit.

To control colorization of your bash prompt, we'll set the PS1 environment variable. This and other shell output don't use the same color controls as above, but instead use colorizing escape sequences. For example, \033[0;36m is cyan. These aren't very easy to remember, so I create a .colors file (see below) that I source in .bashrc to give nice names to all the colors, so I can say ${cyan} instead of \033[0;36m .

Next I set an alias that will display the colors and give me examples:

alias colors='{

echo -e -n "${black}black ${Black}Black ${on_white}${BLACK}BLACK$off "

echo -e -n "${red}red ${Red}Red ${on_yellow}${RED}RED$off "

echo -e -n "${green}green ${Green}Green ${on_blue}${GREEN}GREEN$off "

echo -e -n "${yellow}yellow ${Yellow}Yellow ${on_red}${YELLOW}YELLOW$off "

echo -e -n "${blue}blue ${Blue}Blue ${on_green}${BLUE}BLUE$off "

echo -e -n "${purple}purple ${Purple}Purple ${on_cyan}${PURPLE}PURPLE$off "

echo -e -n "${cyan}cyan ${Cyan}Cyan ${on_blue}${CYAN}CYAN$off "

echo -e -n "${white}white ${White}White ${on_purple}${WHITE}WHITE$off \n"

On my black background, running this looks something like

btaylor@vancouver ~ $ colors

black Black BLACK red Red RED green Green GREEN yellow Yellow YELLOW blue Blue BLUE purple Purple PURPLE cyan Cyan CYAN white White WHITE

You can colorize any output you want. Here's what I do in .bashrc to colorize my bash prompt with git branch awareness. It's a good idea to use big red "PRODUCTION" on production systems.

source ~/.colors

function color_my_prompt {

local user_and_host="\[${Yellow}\]\u@\h"

local current_location="\[${Cyan}\]\w"

local git_branch_color="\[${Red}\]"

local git_branch='`git branch 2> /dev/null | grep -e ^* | sed -E s/^\\\\\*\ $.+$$/$\\\\\1$\ /`'

local prompt_tail="\[${Purple}\]$"

local last_color="\[${off}\]"

export PS1="$user_and_host $current_location $git_branch_color$git_branch$prompt_tail$last_color "

}

color_my_prompt

This will show the git branch in red when I'm in a git repository:

btaylor@vancouver ~/src/cookbook (master) $

A couple references that I've adapted to make this: this stackoverflow.com post, and an arch linux wiki entry. Here's my .colors file:

btaylor@vancouver ~ $ cat .colors

# Reset

off='\033[0m' # Text Reset

# Regular Colors

black='\033[0;30m' # Black

red='\033[0;31m' # Red

green='\033[0;32m' # Green

yellow='\033[0;33m' # Yellow

blue='\033[0;34m' # Blue

purple='\033[0;35m' # Purple

cyan='\033[0;36m' # Cyan

white='\033[0;37m' # White

# Bold

Black='\033[1;30m' # Black

Red='\033[1;31m' # Red

Green='\033[1;32m' # Green

Yellow='\033[1;33m' # Yellow

Blue='\033[1;34m' # Blue

Purple='\033[1;35m' # Purple

Cyan='\033[1;36m' # Cyan

White='\033[1;37m' # White

# Underline

_black_='\033[4;30m' # Black

_red_='\033[4;31m' # Red

_green_='\033[4;32m' # Green

_yellow_='\033[4;33m' # Yellow

_blue_='\033[4;34m' # Blue

_purple_='\033[4;35m' # Purple

_cyan_='\033[4;36m' # Cyan

_white_='\033[4;37m' # White

# Background

on_black='\033[0;40m' # Black

on_red='\033[0;41m' # Red

on_green='\033[0;42m' # Green

on_yellow='\033[0;43m' # Yellow

on_blue='\033[0;44m' # Blue

on_purple='\033[0;45m' # Purple

on_cyan='\033[0;46m' # Cyan

on_white='\033[0;47m' # White

# High Intensty

bLACK='\033[0;90m' # Black

rED='\033[0;91m' # Red

gREEN='\033[0;92m' # Green

yELLOW='\033[0;93m' # Yellow

bLUE='\033[0;94m' # Blue

pURPLE='\033[0;95m' # Purple

cYAN='\033[0;96m' # Cyan

wHITE='\033[0;97m' # White

# Bold High Intensty

BLACK='\033[1;90m' # Black

RED='\033[1;91m' # Red

GREEN='\033[1;92m' # Green

YELLOW='\033[1;93m' # Yellow

BLUE='\033[1;94m' # Blue

PURPLE='\033[1;95m' # Purple

CYAN='\033[1;96m' # Cyan

WHITE='\033[1;97m' # White

# High Intensty backgrounds

on_BLACK='\033[0;100m' # Black

on_RED='\033[0;101m' # Red

on_GREEN='\033[0;102m' # Green

on_YELLOW='\033[0;103m' # Yellow

on_BLUE='\033[0;104m' # Blue

on_PURPLE='\033[10;95m' # Purple

on_CYAN='\033[0;106m' # Cyan

on_WHITE='\033[0;107m' # White

Sunday, July 17, 2011

What is DevOps?

I've gotten really interested in the topic of DevOps lately, after I had an "aha!" moment where I realized that no amount of hope or "trying hard" will get around the fact that development wants change and operations wants stability. Most organizations have created a classic suboptimization problem where they've drawn their organization boundaries in such a way that no one is actually tasked with maximizing the total value delivered.

"DevOps" purports itself to be the solution to this problem, so I wanted to take a look at this concept and try to figure out its merits. I'm going to try to answer a few basic questions: What is DevOps? What are its philosophical origins? Is there reason to think it can deliver on it's promise? Also, every new idea has its critics so I want to examine the leading criticisms of DevOps and see what useful ideas we can take away from them.

OK, so what is DevOps, in a nutshell? DevOps is a set of business practices that an IT organization uses to maximizes the total value it delivers over time when we consider both new functionality and reliability, availability, support and the overall operations cost structure. I know, I know -- you're thinking "gee, that's a little vague". It is, because I haven't said what the "set of business practices" actually is. There are several specific ones, and I'll get to them, in a little bit. I promise. But first, I want to cover the origins, so that we can see that the techniques come from a very principled place.

Philosophical Origins of DevOps

I'm a software engineer, and whenever I talk to my fellow IT guys, I often hear that software is special and unique. There is a lot of truth here, but it often stops us from looking at other arenas for ideas, like lean manufacturing. A lot of "agile development" can be derived by applying lean thinking to software. Whether or not the people that coined the term "agile" knew this consciously or not is an interesting historical question, but it's largely irrelevant. If somebody set off a memory bomb that erased the agile manifesto and all the praxis that sprang from it, we'd start off the next day and say "how can we eliminate waste from the software process, so we minimize everything but the direct creation of value as expressed by the voice of the customer". In fact, I argue that lean thinking takes us a little farther than "agile software development", because, as I think the industry is now coming to grips with, software development is not the entire value stream in IT. DevOps is the continuation of lean and agile applied to the entire IT value stream.

Lean came from manufacturing - it's no secret. When I pitch lean to people in IT I often get the reaction "but we aren't assembling cars". This is a standard mental model anti-pattern about lean. Lean applies to everything a business does, not just to manufacturing. Toyota didn't succeed with lean by just applying it to the assembly line. In fact, the real secret with applying lean to cars is that you have to put a lot of effort into making a lean assembly line, and eliminating waste in the new product introduction process is killer. Toyota's time to market with new models is what crushed the competition back in the day. The fact that their manufacturing floors had less inventory and shorter cycle times made their numbers look better, but people don't buy your cars because of your factories excellent financials, they buy them because your gets fresh thinking to market faster.

DevOps is the Manufacturing Engineering of IT

In manufacturing companies, the hard work of creating the assembly line is done by manufacturing engineering. Let's look at this idea a little more and you'll see that lean manufacturing engineering is really the key idea of the Toyota production system. These guys didn't wake up one day and say "let's set up some kanbans". You can't wait for the car design to be complete and then figure out how to build it. You have to bring the manufacturing mindset into the engineering world. Software development is more akin to automotive engineering. Who is most like manufacturing? Well, it's IT operations - the sysadmins. They keep the factory running. They like stability and highly repeatable processes. In this analogy, DevOps plays the role in IT that manufacturing engineering plays in a manufacturing setting.

So if DevOps is manufacturing engineering applied to IT, what are they really all about. Well, they have to be involved in the development process, solving the problems that ops will care about. And they are involved in the ops process, fixing things that development did wrong, so that we don't institutionalize them. It's always better to do the former, but you won't know what the former is until you've done the latter. Lean is a continuous improvement journey, it understands that you have to eliminate waste one improvement at a time. DevOps is about engineering the software so that operations is lean. So finally, we are ready to look at the specific techniques DevOps brings to the table.

What's In the DevOps Bag of Tricks

DevOps techniques fall into two buckets: cultural and technical. You have to do both because it's an "area of the rectangle" kind of thing. You probably have to do some of the cultural things before you have any hope of implementing the technical things, as otherwise the pointy haired bosses descend on the enlightened engineers and ask them why they aren't doing what they should be doing, which is, of course, a question they don't really want answered.

The Cultural Factors DevOps Incorporates

There are a number of cultural conditions that DevOps leverages. It both requires and builds on these:

Create a climate of continuous improvement. Stop thinking its good to fix things over and over. Start thinking it's good to task people with prevention. Start trying hard to find and fix the bottleneck. Nothing else matters.
Optimize the whole. Task somebody with both shipping new features and keeping operations smooth. These have to be techs who do stuff. Not just management. These people aren't devs and they aren't ops, they are both "DevOps" and neither. But they speak the language of both.
Trust and Respect. Force ops and devs to spend time together. It's easy to bash people when they aren't in the room. The DevOps team from #2 will be making this happen because they will be owning things that cut across.
Be obsessed with eliminating waste. A small team (6-10 people) should have 1 to 3 improvement projects in flight at a time, whose goal is to move the needle. Look at real results and real evidence. The team should prioritize and pick what to solve.
Focus on recovery and prevention. Know your top cause or two of failure and work to prevent them. Deal with the rest by planning for failure. Focus on recovery time. Cut things in half like monitoring latency, app start, VM provisioning, etc...
Build ops concerns into the software. Actually prioritize some developer resources to benefit operations over adding features for customers. Do it because it benefits customers in the long run. If this seems strange to you, you are afflicted with suboptimized thinking. Go talk to your ops. Life is not only about shipping features, but if you want to ship features faster, stop trying to ONLY ship features.

These cultural things are about breaking down the "throw it over the wall" way that software is traditionally delivered to operations and replacing it with a problem solving mentality, and people whose job success depends on results on both sides of the wall. Such people will be highly motivated to dismantle the wall. Help them.

As you solve problems and eliminate waste, you need to be focused on taking the reward by shortening your cycle time and reducing work in progress. The less of this you have, the less total waste when the bigwigs change direction on you. The biggest wastes I've seen in IT is sudden "changes in direction" where projects that were going fine are shelved to free up people. If you ship quarterly, canceling something to go after the new shiny object wastes up to 25% of your organizations resources for the year. If you ship weekly, it hurts but you can deal with it. If you ship daily, nobody even notices.

So, don't ask how you can ship 10% faster. Ask how you can ship 10x more often. Of course, you won't be able to snap your fingers and instantly get there. There will be certain activities that by themselves take longer than 1/10 of your current delivery cycle. If you are on a two week cycle (10 business days), and you say you want daily releases, you might find you spend 2 days on QA and a day doing the actual release. The only thing that matters is cutting these things by an order of magnitude. Do not fire your QA and release people, that is not productive. Move the bulk of their work out of the critical path.

Think it can't be done? IMVU can ship 50 times a day. Etsy can ship 25 times a day. Flickr can ship 10+ times a day. They've solved these problems. Do what they do. Which brings us to the main course, in my view:

The Technical Practices of DevOps

Here's the contents of the technical bag of tricks that defines DevOps, at least as I see it in mid-2011:

Infrastructure Automation. Use cloud and virtualization. Have standard images. No exceptions. Eliminate questions and "creativity" in the provisioning process. Use puppet and chef. Measure your provisioning time. Cut it to minutes or seconds.
Standardized Runbooks. Each application and service that you build can't have it's own story. Developers don't get to change how their app is started, what it's installation looks like, where its logs go, where it's configuration goes, what container it deploys to. DevOps writes this once and stuff that doesn't comply doesn't ship, because we adopt:
Fully Automated Deployments. The app should be in one artifact, it's configuration in another. The deployer takes one bit of information (the app/service name) and looks for updates in the one standard way. If they exist, they are pulled down and installed. One click deployments, then...
Continuous Deployment. One click deployment is one click too many. Build a pipeline and when all the tests pass, no clicks happen and the code is promoted and installed.
Advanced Test Driven Development. Not just unit and integration tests. I'm talking english language Behavior Driven Development (eg: Cucumber) Including for your UIs. 100% no exceptions. Have your quality/compliance team do audits to make sure. Even this is not enough:
Behavior Driven Infrastructure. Use the test driven concept to loop back to infrastructure automation. Do not fill out request forms to get stuff. Ever, even if it's automated. Write Behavior Driven Infrastructure tests that deploy to your monitoring infrastructure that assert that your environments will be pingable, will be ssh'able, will be https'able, and will have the feature behavior defined by your BDD. Fail the test, provision, pass some / fail some, deploy, pass all tests, monitor forever.
Minimal Marketable Features (MMFs). If it's possible to split your feature, do so. When developers finish stuff, assign them to in-flight MMFs first until those are "full". Stop starting and start finishing. Only pull new features into WIP when forced to because a free developer can't help anything in flight go faster. Management can juggle the roadmap or backlog all they want until it's WIP.
Ship When Done. I've never understood timeboxed iterations. I call them calendar complacency. Many agile proponents haven't heard: Timeboxed Iterations are Dead.
Runbook Automation - Take common failure modes and automate their responses. Have a socket leak slowly filling up your file handles. No!? Good. Monitoring them and automate the bounce anyway. Have bad memory?!? No!? Good. Automate a from scratch deployment anyway.
Feature Flags and Dark Launches. Every environment besides production is waste. Get rid of it. Keep the virtual segregation of non-production code with feature flags. Prove it works in production before users see it with dark launches, not with expensive "production-like" staging system. Turn broken stuff off by unflipping a bit.
Perpetual Beta. Let customers control who can see "Beta" stuff. Call this user acceptance testing, so that you can get rid of the waste. Let internal customers pull value by controlling when "beta" ends for a feature. Deliver features fast enough so there is always something in Beta.
Automated Recovery. When web server #3 has some issue, what should the response be? Spin up a new VM, redeploy the app, put it in the pool, and throw away #3. Hone your ability to do this quickly. Measure it in seconds.
Continuous Delivery. Remove all human intervention in the pipeline between the writing of your feature acceptance criteria as a BDD/BDI test and it's handoff to customers when the tests pass. Understand continuous delivery vs continuous deployment.
Metrics. Measure stuff that matters. Time is money, so measure how long things take. MTTR is critical, so things like failover time, rebuild from scratch time, app start time, need to be measured. Performance is important, so latency and throughput, etc... should be quantified. There's only a couple other useful things: test code coverage is good, cyclomatic complexity of your code is good.
Process Tooling - Have a single source control solution. Allow all techs to see everything in it, across all teams. Invest in CI environments, monitoring/BDI, and both runbook and infrastructure automation tools. DevOps should own administration of these, and be expected to use them to demonstrate waste eliminate in the IT process. DevOps delivers metrics data, as above, and owns the plan and its execution to improve those metrics by leveraging software process tooling.

Does This Stuff Work? What about Criticism?

IMVU can released 50+ times a day, etsy.com releases 25 times a day. Flickr releases 10+ times a day. These are big sites with lots of users and complicated use cases. These software teams have created really strong cooperation between their development and operations teams. They are probably running orders of magnitude more tests than you are in a fraction of the time, because they have focused on removing waste this way for years. So yes, these techniques work. All I've done here is itemize the techniques these guys say work, and explain how they are really just the application of lean thinking to the IT value stream.

However, like any buzzword in IT, "DevOps" has it's critics. Some of the criticisms are standard nay-sayer cynicism. Some criticisms are legitimate warnings not to turn a good idea into a mindless checklist solution. Let's look at some of the common push-back on DevOps.

One criticism is that "elitist sysadmin club to rebrand an existing problem" (see wikipedia for origins). Well, elitism is bad, ok. The fact that sysadmins might be involved is rather expected. Nothing in the culture or technical bag of tricks above requires you to be an elitist sysadmin to be successful, or to be the member of any particular club. Quite the contrary, this bag of tricks is open to anyone seeking to maximize the value IT ships. More correctly though, DevOps may very well be rebranding an existing problem. The fact that half the software industry still ships monthly or longer probably means that the previous brand didn't work too well on them. As long as there is complacency about the gratuitous waste in IT, rebranding the solution makes sense. The point is that the solution -- apply lean thinking to eliminate your waste one project at a time -- works, and is pretty much not likely to be displaced by a better solution.

Another deflection seems to be "we don't need to ship that often". This is just the voice of mediocrity. Solving problems and delivering real improvements is hard, so why bother. These people will fight to avoid change, but not too hard, because they are lazy and true resistance takes work. The CIO and his VPs have to lead. If they are questioning the value of greater agility, either go work somewhere else or wait for them to feel the heat from above. Businesses are demanding faster change and IT can't deliver that with quarterly and monthly release cycles anymore. If they try, they get bruised when the business changes its mind on things and weeks or months of effort is wasted.

Another criticism is that the whole DevOps "movement" is here to sell "books, training, conferences, the whole bit." and that "your organization will not be fixed by some sunshine-up-your-ass methodology you read about in a blog or hear about at a conference". See Ted Dzubia's DevOps is a Poorly Executed Scam blog. There's some truth to the notion that any good idea quickly gets over-marketed by people who peddle buzzword compliance. On the other hand, it's really cynical to suggest that anything used to sell books, training, and conferences doesn't work. Should we stop reading books and stop going to conferences? That's silly. And not everybody is selling something. I'm not. Real people with common problems like to share what works with each other.

Improvement is hard. No expensive talking head in a suit can tell you what subset of the 15 technical techniques above solves the top reason your organization can't ship software faster. But the people that ship 10+ times a day got there doing things on this list. Only you can figure out how to apply it to your organization. So get started.

Fixing Git Commit Messages

Problem: Commits in git are immutable, so how do you recover if you make a mistake when you commit? For example, I forgot to put my task number, "ITSM-197" at the beginning of my commit message. I want to fix this before pushing to a remote repo.

btaylor@vancouver ~/src/change-service (ITSM-197) $ git log --oneline master..

87b9f6b fix response elements outside of method element

3123405 ITSM-197 change approver to approval-level

56ecbb4 ITSM-197 remove duplicate declaration of status attribute

7f10503 ITSM-197 fix change request state machine representation in xsd

Whoops, commit 87b9f6b didn't follow the convention. I fix this with "git commit --amend –m" which replaces the tip of the current branch with a new modified commit. You can leave the –m option off to actually change the commit itself (and keep the existing message).

btaylor@vancouver ~/src/change-service (ITSM-197) $  git commit --amend -m "ITSM-197 fix response elements outside of  method element"
[ITSM-197 c7310f8] ITSM-197 fix response elements outside of method element
 1 files changed, 6 insertions(+), 6 deletions(-)


 
btaylor@vancouver ~/src/change-service (ITSM-197) $ git log --oneline master..
c7310f8 ITSM-197 fix response elements outside of method element
3123405 ITSM-197 change approver to approval-level
56ecbb4 ITSM-197 remove duplicate declaration of status attribute
7f10503 ITSM-197 fix change request state machine representation in xsd

Solved. Note that the chain of commits from the master branch has been rewritten to end with my new commit c7310f8 with my ITSM-197 message added. Let's verify that these have the same tree:

btaylor@vancouver ~/src/change-service (ITSM-197) $ git cat-file commit c7310f8

tree f1c5c976a169c513800d9cd99a776957f503886e
parent 3123405682b942e4875399c66265caad42260d64
author Bryan Taylor <btaylor@nospam.com> 1310325458 -0500
committer Bryan Taylor <btaylor@nospam.com> 1310395725 -0500


ITSM-197 fix response elements outside of method element
 
btaylor@vancouver ~/src/change-service (ITSM-197) $  git cat-file commit 87b9f6b tree f1c5c976a169c513800d9cd99a776957f503886e
parent 3123405682b942e4875399c66265caad42260d64
author Bryan Taylor <btaylor@nospam.com> 1310325458 -0500
committer Bryan Taylor <btaylor@nospam.com> 1310325458 -0500


fix response elements outside of method element

Yep, they both have tree f1c5c97. Note that the previous 87b9f6b commit still exists, but it's no longer in the commit chain from master to the tip.