Saturday, March 17, 2012

Sizing of RESTful Paged Collection Resources

There's been a debate lately in my IT department about how to implement paged collections. To provide some context, we've agreed to use the REST architectural style for all of our web services. We more than a handful of teams building such services, and almost every one of them at one point or another has to provide resources for some kind of collection. Specifically the question is whether it's appropriate to offer the client control over the page size via a parameter such as "limit" in the resources URI. I feel rather strongly that this is bad design, for a variety of reasons to follow.

First, let's review why we want to page our collections and what we are trying to achieve via paging. Foremost, we absolutely don't allow unbounded size representations because we view them as a potential denial of service vulnerability for both clients and servers. Both clients and servers have finite resources, such as memory. Our persistence stores hold potentially big data sets that very easily could overwhelm our clients or servers, especially when we marshall and unmarshall a big representation. This consumes at least O(N) resources for N records, and is often worse. If we don't put some sanity on the server side, a naughty or unwitting client might ask for a large data set. The server might bog down trying to construct the response. If the client sees its response time exceeded, you can rest assured that our friendly user will respond to the timeout error message by resubmitting the request. Iterate until your sysadmin gets the 3AM phone call. Or maybe the server manages to construct the monster data set and so the client gets what he asked for. After 7 minutes of parsing a bazillion records it crashes and again the dreaded 3AM phone call happens to a different sysadmin.

As an architect, I have a solution. I butt my nosy nose in and say "I don't care what clients say they need or what servers feel like implementing, THOU SHALT have an O(1) bound on representation sizes via as a maximum number of records in a page". The developers hate me and the sleepless sysadmins love me and take me out drinking until 3AM and get no sleep anyway, but at least it was their choice. Oh, and the developers showed up too and now they like me again. Hooray, beer.

Fast forward a few months. Now all the teams accept that there will be page sizes. What do developers do when they are asked to pick a parameter like page size? One of two things: either they configure it with their favorite splendid, bitchin' config file DSL language **OR** they figure somebody else should pick so they invent ways to have their clients pick by taking in the page size from the URI. Some are clever and do both: they set the default in the config file and they put the parameter in the URI so that it's optional. QA still tests it. They don't have enought to do, so who cares. Inevitably the devs will have meeting notes where somebody unnamed said "sure, I guess we could supply the page size" in response to a carefully crafted leading question that translates to "is there any scenario you can possibly imagine where you would want to control the value that I might arbitrarily select and shove down your through if you say no?" And of course, there is an undertone of "what kind of developer are you, passing on a chance to set a value that SOMEONE has to set?".

So now you have the context of the discussion. We all figured out pretty quickly that the fancy pants link relations "next", "prev", "first", "last" were our friends. For a while we argued endlessly over how to structure the URIs to name which page. Some people use offset, some use a marked record, others use page number. Then one day we stopped arguing because IT DOESN'T MATTER. We don't need a standard other than we follow rel="next" to get to the NEXT page. And as it turns out, my friend Mark Nottingham has in fact formally standardized this in RFC 5005 and RFC 5988. We did still find a way to argue about whether next should take you to younger or older records. "Next should not mean earlier!" Then we read Mark's stuff more carefully and realized that again either way is fine because it's the next RESOURCE in a server defined ordering of resources. So we had to find something else to argue about.

So we landed on limit. Yes there would be a maximum on limit, but could a client ask for a smaller page? That is a pressing and important question. I have an answer: NO. Damn it, No. I would forbid all services from EVER having a limit parameter in any URI and I would force them to actually decide a FIXED page size for each collection type. I'd even build it right in to the media type definition and tell the services they don't even own specifying it. In the spirit of good will, I'm willing to compromise on this last point. Maybe I'll let them configure it in their config file, instead of documenting it in the media type, but under no circumstances should they offer their clients control of the page size. Damn it.

Why? Am I a crusty architect bent on making proclamations that confound developers and make them question the color of my soul!? Well, maybe, but that's not the point today. There are four good reasons:
1) YAGNI
2) Fixed page sizes support Better HTTP caching
3) Homogenous page sizes simplify response time SLAs and rate limiting
4) RESTful APIs should eschew RPC in favor of HATEOAS

Let's go through these in turn:

1) YAGNI - Clients don't need to configure it. You don't need to let them configure it. Get rid of all that code solving non-existent problems for nobody. Your clients will be just fine and happy. You don't have to add a stupid method to configure it and they don't have to add a stupid parameter to their calls to give you the stupid value for the stupid parameter. Somebody always says "what about this hint of a shadow a client that has to code for a big list in browsers and a small list in a mobile device? SHOULDN'T we help them by offering them a variable they can set to size our pages to their display needs?" Ummm, let me think.... NO. Keep your UI concerns out of my web service API. Your visible record set is not relevant to my page size. Go to dzone.com and look at how you should do it. Fill up your screen from your buffer and call "next" lazily to add to the buffer when the user gets is close to diplaying the last record. That way there is no bottom, scroll forever. Sweet.

2) Better HTTP caching - If your collection is a nice append only kind of thing, guess what? People start from where they left off and scroll forward (which as we established earlier is a synonym for "in some consistent direction"). This means fixed page sizes take every client to the same set of pages. Cache hit rate, baby. Love it. Stick it in varnish or squid and yawn when 500 users poll you every second. Boring. Play quake with the extra server juice while you compile the latest linux kernel while you do some video editing. Given them configurable page sizes just imagine the spindles of your storage physically spinning as the jackass wants page size of 37 traverses the same old data in new and exciting ways. Varnish? FAST. Spindles grinding for pagesize 37? SLOW.

3) SLAs and Rate Limiting. If every box of chocolates has 25 chocolates, then it takes about the same time for Forrest Gump to eat one box of chocolates as another. So you can say "Forrest, eat that box of chocolates in 40 seconds". Similarly, you can say "Forrest, don't eat more than 15 boxes in an hour". If you don't assure that all boxes have the same number of chocolates, Forrest will find the ones that have 2 chocolates each and complain that you cut him off needlessly. And he'll be right. Homogenous page sizes mean that the resource consumption and response time of one "hit" does not depend on anything. To make reasonable statements about your capabilities you don't have to do statistics to factor out confounding variation. You just count and say things like "all pages come back in 500 ms". Well, almost, somebody will fire up JMeter and put 1000 threads of whoop ass on your server and go "hmm, you can't scale that for ever". But then you go "hmmm, I thought we said our servers scale to handle 500 concurrent connections, why did you do 1000?" You get your O(1) bound in and you win.

4) HATEOAS > RPC. Ahem. Repeat after me: Dr. Fielding says all application state transitions must be driven by client selection of server-provided choices in hypertext. Dr. Fielding says that the descriptive effort in defining a REST API should be spent on media types and link relations and not on how clients can treat your URIs like some kind of remote procedure call. Really. He did say soIf Roy said it, you must obey. Ok, appeal to authority is a fallacy. But, in fact, there are good reasonsWhy HATEOAS is the best that aren't traceable to the cult of Roy.


We didn't switch to REST from SOAP because we like four letter acronyms. The irony of ironies is that your fancy pants parameter, designed to give the client control sits there, laughing at you, because if you do REST right, you, Mr. Web Service Developer STILL end up making the choice of what value to put in it. The client goes to some resource where he can see a link to the collection. He told you nothing about page size to get there. You HAVE to populate the link with no input from him. What value do YOU put in it? Play the evil architect dirge: Bwa-ha-hahahaha. Now you realize that you've been insane for all these years. All your efforts to give others control failed. Run for Congress, those guys suck at what you do. 


Here though, clients don't want to understand what your stupid parameters mean. They don't want to read the documentation to figure out how to handle the error if they specify a number larger than you accept (you do return an error, right?). They want to follow links like "next", and ponder deep questions like "does that go forward or backward in time". Or maybe they do want to call our beautiful REST API like its a COBOL procedure. Some clients devs just don't get it, even in 2012, and they want to couple to everything bad and be derisive to "theory" and stuff that interfere with their agile ability to create brittle code laced with tribal knowledge. What should we do with these guys? We should give them no levers to lift, no knobs to turn. RPC is so SOAP old school lameo. Send them back to it. Or better yet make them write access objects and transfer objects for our NoSQL persistence stores. Haha.


To conclude, set the page size uniformly in either the media type definition or gasp, in your service's config file. Having trouble picking the actual value!? Go with 15. No, 100. No 25. Whatever. Nobody actually cares what it is anyway.

Sunday, August 21, 2011

Availability in the CAP Theorem is Not What You Think

I just realized that "Availability" in the CAP Theorem is not what I thought it was. My expectation is that if there is an explicit latency specification (eg: return within 2 seconds) that this is part of its "agreed function" that has to be met to be "up". Availability in the CAP theorem means something different.

I've been taking another look back at the CAP theorem, because I have to build a RESTful web service that supports US and European users, each with very low latency, and each with a pretty robust availability specification. If you look at the ITIL definition of availability, it's phrased as the percentage of the time a service can perform its agreed function. As I said, part of its "agreed function" is to return quickly.

In the real world, you measure availability with monitoring tools that poll on some frequency and verify that the service responds appropriately within some period of time. The monitoring tools open a thread, issue their network request(s) and have a timer that expires and declares the service is "down". No monitoring tool ever will leak a thread waiting indefinitely long. Anyway, back to the CAP theorem.

The definitive source of information about the CAP theorem is its proof, by Gilbert and Lynch. This is a short paper, and is approachable and useful if you aren't scared of proofs. In the section on Availabilitiy (2.2), they define it like so:
For a distributed system to be continuously available, every request recieved by a non-failing node in the system must result in a response. That is, any algorithm used by the service must eventually terminate. In some ways this is a weak definition of availability: it puts no bound on how long the algorithm may run before terminating, and therefore allows unbounded computation.
So taking a 10 hours or 3 days to return the answer is "available" under this definition. This really is more of a halting condition.

This realization changes the way I think about the CAP theorem a little. The definition of "partition tolerance" contemplates working correctly in spite of message loss. "No set of failures less than total network failure is allowed to cause the system to respond incorrectly". But if there is no bound on how long the system can take, it can deal with network partitions by retrying each message exchange until it succeeds. I suspect that the theorem is still true because this allows arbitrary delay in processing, and I suspect that this breaks down consistency, which requires total ordering of changes and each to appear to have completed at a point in time across the whole system.

The upshot of this, is that I don't think anybody really cares if a real world system has the availability property. We generally want something stronger and weaker: "most" of the time (eg: 99.9%), we want some kind of fixed bound on latency. It's stronger because of the fixed bound and weaker because we are allowed to fail some small percentage of the time. A real system probably only fails to meet such an availability response time bound when there is a failure, so we really are specifying "how rare is a partition".

This makes Abadi's PACELC approach all the more useful. A real world availability criteria is going to effectively tell us two things: what is the latency spec when we are not partitioned, and how reliable must the system be in terms of not having partitions. The PACELC approach has you trade latency vs consistency in the happy path (unpartitioned) case, and when packet loss happens, you trade availability vs consistency.


Sunday, July 31, 2011

Feature Branches are Not Evil. In fact, they Rock.

There have been several notable blogs lately decrying the faults of feature branches. Both Martin Fowler and Jez Humble have blogged on this topic. Their position contravenes both the proven success of massive projects like the linux kernel who have made forking a way of life and the tenents of lean thinking which tell us to do value added tasks (here, merging) just in time. Their argument is that feature branches cause you to accumulate code destined for release in a multiple places and this fragmentation impedes the benefits of continuous integration, makes refactoring harder, and generally inhibits communication via code. These code quality assurance practices should be pushed earlier upstream, but premature integration is not the best way to do this.

Both authors appear to assume an enterprise development setting where we assume all contributors are committers and they don't provide any explanations for why their advice rejects the standard practice of major open source communities who wholesale ignore it. The rise of distributed SCMs such as git happened specifically to allow the fork-modify-merge development pattern, and the pull request has become an accepted best practice. Git was specifically designed to support the linux kernel's development process which uses forking on a massive scale.

I will argue that there are good reasons to use feature branches in both enterprise and open source settings and that there are good mitigations to the problems they discuss. I see having feature branches as a form of inventory, so that they should be minimized in both their duration and their quantity, but like inventory, they occur in the real world to mitigate risk and variation.

I recommend a model where we create feature branches based on "minimal marketable features" (MMFs). By definition, such a feature is atomic in that no proper subset of it has value to users. In a corporate setting, a team should follow the advice to "stop starting and start finishing" by allocating available developers first to in progress MMFs, until they are "full". Only when we can't do this should we pull a new feature from the queue. By following these two practices, we guarantee that feature branches that support MMFs are of minimal duration and multiplicity. I don't claim this alone addresses the issues raised by Fowler and Humble, but it will go a long way towards avoidance. You simply should not be working on epics, but instead break them down into MMFs, put them in your backlog, and work like hell to finish anything you start as quickly as possible -- not by "trying harder" but by managing the process to assure it's lean: do not start something new if you can instead make something in-flight finish sooner.

I'd also like to point out that, whether we work in an open source community or an enterprise setting, we really don't know that a feature will be successfully completed and incorporated into the product until it is. Software development involves risk. Half of software projects fail, and there is evidence to prove it. Sometimes we encounter technical difficulties. Sometimes "the business" changes its mind about what it wants. Sometimes there are re-organizations. Sometimes we were trying to innovate and fail.  Sometimes the lead developer gets sent to jail for murdering his wife or gets hit by a bus. In open source settings, the maintainers get final say on what goes in and they have to turn away contributions that don't pass muster. They try to accept contributions, but have to maintain their standards and design integrity.  If you are the contributor, there is no absolute guarantee your work will be merged until it is.

Short of a feature totally failing, delay is a much more common risk. If we have two feature branches in flight, we simply don't know which one will ship first. We can estimate, predict, etc..., but we can't become trapped by those predictions. When multiple features are in flight at once, we should remove all non-essential work from the value path and let whoever gets there first win. Then the other branches will have to deal with rebasing, but notice the work of merging only impacts one of the branches. I'll talk more later about how to reduce the pain of losing the race.

If we apply "lean thinking", we want to move quality functions earlier, and do value added activities at the last responsible moment. The problem with the positions taking by Humble and Fowler is that they are treating a value added activity as a quality improving activity, so they try to push it upstream, doing it early and often. That's why they focus on refactoring, communication, and merge collision detection instead of focussing on minimizing the lead time to the next feature and total cycle time. Letting one of the teams not have to do the merge before they ship helps both.

Suppose that we have feature A and B, but deep into development on B we realize that there is a much better solution for B and that what we really want is C, so we scrap B entirely. This happens all the time in the real world, and it highlights that premature merging is a form of code inventory and a form of waste. It's bad enough that both teams had to deal with merging code from A and B, when the minimum waste solution would be to to impact the cycle time of only one of them. But here, since B is cancelled, the merge ended up being non-value added work, and even worse it was actively harmful, because to ship A we have to back out code for B. If you are a developer on A, you are pretty pissed off by this.

When you work on MMFs, which are worthless until done, the notion that feature branches accumulate code "destined for release" is an optimistic but unproven assertion. We obviously wish it was certain, but we also need to realize that shipping code that doesn't actually function is waste, especially if the quantity of such code grows over time. Dead code bloats and confuses our codebase, and is technical debt. I know that "feature flags" are all the rage now, but we need to be clear about what their benefits are. It's not so that we can release partial, non-working code, and hide it. This is a form of resource leak if we don't clean them up, especially when features are cancelled or delayed. I don't want 3000 feature flags piling up over the years. Feature flags mitigate risk associated with change by simplifying back outs when something goes wrong. More importantly, feature flags allow us to do limited real-world testing in the production environment instead of some expensive staging environment. Eliminating whole non-production environments is a whole lotta waste saved. They also give customers control by enabling pull, so it's their choice as to when features are activated. Splitting delivery and activation is really the key benefit. We should only deliver working features, a point Humble and Fowler agree with, but recall that if we work on MMFs there are no partial working features.

OK, so what about the difficulties with continuous integration, refactoring, and communication that Humble and Fowler are worried about? Suppose we end up with feature branch A and B in flight at the same time, how can we make it all work. I agree with Jilles van Gurp, who blogged about git best practices when we are forced to deal with this. I see there being two main issues: communication and early conflict detection. Refactoring comes in two forms: either I'm refactoring to support my feature, or I'm solving technical debt for it's own sake. The latter, I would argue, should be treated like any other feature, and in particular, refactoring as technical debt cleanup should also follow the MMF model, so that I'm not doing epic refactoring (pun intended). If you follow this, I don't think whether a feature involves some refactoring or not really matters in the final analysis. Feature A and B each introduce some changes, which might conflict. The only question is how you deal with it when they do.

I think the answer is relatively simple. I should create a special branch that automatically merges any live feature branches. Merge failures are reported back, and the teams should talk when they happen. Jenkins automates this, as described here. This integration branch is always expected to contain unstable development code. The key difference here is that I am trying to merge every time code is pushed to any feature branch. If one of my features has to be cancelled, I reset the integration branch based on the new set of in-flight feature branches (dropping the cancelled one). The key difference between this solution and what Humble and Fowler advocate is that I never expect this merge branch to be releaseable, but they do. When a merge conflict is detected this way, the teams have to talk about what to do. They have options.

So in summary, I think MMF style feature branches that "ship when done" work well and don't suffer the problems Fowler and Humble worry about if we use the right tooling to automerge and communicate.

Saturday, July 23, 2011

Mac Terminal Colors & Git Prompt

I always get slightly annoyed by the default color settings in terminals. Fedora had issues with this for a long time. When I got a Mac, I was rather disappointed to see that with a black background the standard color set was hard to read. Anyway, I dug into setting bash terminal colors and thought I'd share.


To colorize your terminal you need these settings in your shell. I put them in my .bash_profile:
export CLICOLOR=1
export LSCOLORS=GxFxCxDxBxegedabagaced

The CLICOLOR flag turns on color, and the LSCOLORS picks the colors for various types of objects in your directory structure. LSCOLORS is slightly different from how it works on linux, as a Mac follows the BSD way. There are 11 color settings concatenated together here, where each is defined by a pair of characters. The first character is text color, and the second is background. Capital is bold or enhanced for backgrounds. The "x" character means default. The color map is a=black, b=red, c=green, d=yellow, e=blue, f=purple, g=cyan, h=white.

The 11 settings, in order, are: directory, symlink, socket, pipe, executeable, block special, character special, executable with setuid, executable with setgid, directory writeable to others with sticky bit, directory writeable to others without sticky bit.

To control colorization of your bash prompt, we'll set the PS1 environment variable. This and other shell output don't use the same color controls as above, but instead use colorizing escape sequences. For example, \033[0;36m is cyan. These aren't very easy to remember, so I create a .colors file (see below) that I source in .bashrc to give nice names to all the colors, so I can say ${cyan} instead of \033[0;36m .

Next I set an alias that will display the colors and give me examples:
alias colors='{
  echo -e -n "${black}black ${Black}Black ${on_white}${BLACK}BLACK$off "
  echo -e -n "${red}red ${Red}Red ${on_yellow}${RED}RED$off "
  echo -e -n "${green}green ${Green}Green ${on_blue}${GREEN}GREEN$off "
  echo -e -n "${yellow}yellow ${Yellow}Yellow ${on_red}${YELLOW}YELLOW$off "
  echo -e -n "${blue}blue ${Blue}Blue ${on_green}${BLUE}BLUE$off "
  echo -e -n "${purple}purple ${Purple}Purple ${on_cyan}${PURPLE}PURPLE$off "
  echo -e -n "${cyan}cyan ${Cyan}Cyan ${on_blue}${CYAN}CYAN$off "
  echo -e -n "${white}white ${White}White ${on_purple}${WHITE}WHITE$off \n"
}'

On my black background, running this looks something like
btaylor@vancouver ~ $ colors
black Black BLACK red Red RED green Green GREEN yellow Yellow YELLOW blue Blue BLUE purple Purple PURPLE cyan Cyan CYAN white White WHITE

You can colorize any output you want. Here's what I do in .bashrc to colorize my bash prompt with git branch awareness. It's a good idea to use big red "PRODUCTION" on production systems.

source ~/.colors
function color_my_prompt {
    local user_and_host="\[${Yellow}\]\u@\h"
    local current_location="\[${Cyan}\]\w"
    local git_branch_color="\[${Red}\]"
    local git_branch='`git branch 2> /dev/null | grep -e ^* | sed -E  s/^\\\\\*\ \(.+\)$/\(\\\\\1\)\ /`'
    local prompt_tail="\[${Purple}\]$"
    local last_color="\[${off}\]"
    export PS1="$user_and_host $current_location $git_branch_color$git_branch$prompt_tail$last_color "
}
color_my_prompt

This will show the git branch in red when I'm in a git repository:
btaylor@vancouver ~/src/cookbook (master) $ 

A couple references that I've adapted to make this: this stackoverflow.com post, and an arch linux wiki entry. Here's my .colors file:

btaylor@vancouver ~ $ cat .colors
# Reset
off='\033[0m'       # Text Reset

# Regular Colors
black='\033[0;30m'        # Black
red='\033[0;31m'          # Red
green='\033[0;32m'        # Green
yellow='\033[0;33m'       # Yellow
blue='\033[0;34m'         # Blue
purple='\033[0;35m'       # Purple
cyan='\033[0;36m'         # Cyan
white='\033[0;37m'        # White

# Bold
Black='\033[1;30m'       # Black
Red='\033[1;31m'         # Red
Green='\033[1;32m'       # Green
Yellow='\033[1;33m'      # Yellow
Blue='\033[1;34m'        # Blue
Purple='\033[1;35m'      # Purple
Cyan='\033[1;36m'        # Cyan
White='\033[1;37m'       # White

# Underline
_black_='\033[4;30m'       # Black
_red_='\033[4;31m'         # Red
_green_='\033[4;32m'       # Green
_yellow_='\033[4;33m'      # Yellow
_blue_='\033[4;34m'        # Blue
_purple_='\033[4;35m'      # Purple
_cyan_='\033[4;36m'        # Cyan
_white_='\033[4;37m'       # White

# Background
on_black='\033[0;40m'       # Black
on_red='\033[0;41m'         # Red
on_green='\033[0;42m'       # Green
on_yellow='\033[0;43m'      # Yellow
on_blue='\033[0;44m'        # Blue
on_purple='\033[0;45m'      # Purple
on_cyan='\033[0;46m'        # Cyan
on_white='\033[0;47m'       # White

# High Intensty
bLACK='\033[0;90m'       # Black
rED='\033[0;91m'         # Red
gREEN='\033[0;92m'       # Green
yELLOW='\033[0;93m'      # Yellow
bLUE='\033[0;94m'        # Blue
pURPLE='\033[0;95m'      # Purple
cYAN='\033[0;96m'        # Cyan
wHITE='\033[0;97m'       # White

# Bold High Intensty
BLACK='\033[1;90m'      # Black
RED='\033[1;91m'        # Red
GREEN='\033[1;92m'      # Green
YELLOW='\033[1;93m'     # Yellow
BLUE='\033[1;94m'       # Blue
PURPLE='\033[1;95m'     # Purple
CYAN='\033[1;96m'       # Cyan
WHITE='\033[1;97m'      # White

# High Intensty backgrounds
on_BLACK='\033[0;100m'   # Black
on_RED='\033[0;101m'     # Red
on_GREEN='\033[0;102m'   # Green
on_YELLOW='\033[0;103m'  # Yellow
on_BLUE='\033[0;104m'    # Blue
on_PURPLE='\033[10;95m'  # Purple
on_CYAN='\033[0;106m'    # Cyan
on_WHITE='\033[0;107m'   # White


Sunday, July 17, 2011

What is DevOps?

I've gotten really interested in the topic of DevOps lately, after I had an "aha!" moment where I realized that no amount of hope or "trying hard" will get around the fact that development wants change and operations wants stability. Most organizations have created a classic suboptimization problem where they've drawn their organization boundaries in such a way that no one is actually tasked with maximizing the total value delivered.

 "DevOps" purports itself to be the solution to this problem, so I wanted to take a look at this concept and try to figure out its merits. I'm going to try to answer a few basic questions: What is DevOps? What are its philosophical origins? Is there reason to think it can deliver on it's promise? Also, every new idea has its critics so I want to examine the leading criticisms of DevOps and see what useful ideas we can take away from them.

OK, so what is DevOps, in a nutshell? DevOps is a set of business practices that an IT organization uses to maximizes the total value it delivers over time when we consider both new functionality and reliability, availability, support and the overall operations cost structure. I know, I know -- you're thinking "gee, that's a little vague". It is, because I haven't said what the "set of business practices" actually is. There are several specific ones, and I'll get to them, in a little bit. I promise. But first, I want to cover the origins, so that we can see that the techniques come from a very principled place.

Philosophical Origins of DevOps

I'm a software engineer, and whenever I talk to my fellow IT guys, I often hear that software is special and unique. There is a lot of truth here, but it often stops us from looking at other arenas for ideas, like lean manufacturing. A lot of "agile development" can be derived by applying lean thinking to software. Whether or not the people that coined the term "agile" knew this consciously or not is an interesting historical question, but it's largely irrelevant. If somebody set off a memory bomb that erased the agile manifesto and all the praxis that sprang from it, we'd start off the next day and say "how can we eliminate waste from the software process, so we minimize everything but the direct creation of value as expressed by the voice of the customer". In fact, I argue that lean thinking takes us a little farther than "agile software development", because, as I think the industry is now coming to grips with, software development is not the entire value stream in IT. DevOps is the continuation of lean and agile applied to the entire IT value stream.

Lean came from manufacturing - it's no secret. When I pitch lean to people in IT I often get the reaction "but we aren't assembling cars". This is a standard mental model anti-pattern about lean.  Lean applies to everything a business does, not just to manufacturing. Toyota didn't succeed with lean by just applying it to the assembly line. In fact, the real secret with applying lean to cars is that you have to put a lot of effort into making a lean assembly line, and eliminating waste in the new product introduction process is killer. Toyota's time to market with new models is what crushed the competition back in the day. The fact that their manufacturing floors had less inventory and shorter cycle times made their numbers look better, but people don't buy your cars because of your factories excellent financials, they buy them because your gets fresh thinking to market faster.

DevOps is the Manufacturing Engineering of IT

In manufacturing companies, the hard work of creating the assembly line is done by manufacturing engineering. Let's look at this idea a little more and you'll see that lean manufacturing engineering is really the key idea of the Toyota production system. These guys didn't wake up one day and say "let's set up some kanbans".  You can't wait for the car design to be complete and then figure out how to build it. You have to bring the manufacturing mindset into the engineering world. Software development is more akin to automotive engineering. Who is most like manufacturing? Well, it's IT operations - the sysadmins. They keep the factory running. They like stability and highly repeatable processes. In this analogy, DevOps plays the role in IT that manufacturing engineering plays in a manufacturing setting.

So if DevOps is manufacturing engineering applied to IT, what are they really all about. Well, they have to be involved in the development process, solving the problems that ops will care about. And they are involved in the ops process, fixing things that development did wrong, so that we don't institutionalize them. It's always better to do the former, but you won't know what the former is until you've done the latter. Lean is a continuous improvement journey, it understands that you have to eliminate waste one improvement at a time. DevOps is about engineering the software so that operations is lean. So finally, we are ready to look at the specific techniques DevOps brings to the table.

What's In the DevOps Bag of Tricks

DevOps techniques fall into two buckets: cultural and technical. You have to do both because it's an "area of the rectangle" kind of thing. You probably have to do some of the cultural things before you have any hope of implementing the technical things, as otherwise the pointy haired bosses descend on the enlightened engineers and ask them why they aren't doing what they should be doing, which is, of course, a question they don't really want answered.


The Cultural Factors DevOps Incorporates

There are a number of cultural conditions that DevOps leverages. It both requires and builds on these:
  1. Create a climate of continuous improvement. Stop thinking its good to fix things over and over. Start thinking it's good to task people with prevention. Start trying hard to find and fix the bottleneck. Nothing else matters.
  2. Optimize the whole. Task somebody with both shipping new features and keeping operations smooth. These have to be techs who do stuff. Not just management. These people aren't devs and they aren't ops, they are both "DevOps" and neither. But they speak the language of both.
  3. Trust and Respect. Force ops and devs to spend time together. It's easy to bash people when they aren't in the room. The DevOps team from #2 will be making this happen because they will be owning things that cut across.
  4. Be obsessed with eliminating waste. A small team (6-10 people) should have 1 to 3 improvement projects in flight at a time, whose goal is to move the needle. Look at real results and real evidence. The team should prioritize and pick what to solve.
  5. Focus on recovery and prevention. Know your top cause or two of failure and work to prevent them. Deal with the rest by planning for failure. Focus on recovery time. Cut things in half like monitoring latency, app start, VM provisioning, etc...
  6. Build ops concerns into the software. Actually prioritize some developer resources to benefit operations over adding features for customers. Do it because it benefits customers in the long run. If this seems strange to you, you are afflicted with suboptimized thinking. Go talk to your ops. Life is not only about shipping features, but if you want to ship features faster, stop trying to ONLY ship features.
These cultural things are about breaking down the "throw it over the wall" way that software is traditionally delivered to operations and replacing it with a problem solving mentality, and people whose job success depends on results on both sides of the wall. Such people will be highly motivated to dismantle the wall. Help them.

As you solve problems and eliminate waste, you need to be focused on taking the reward by shortening your cycle time and reducing work in progress. The less of this you have, the less total waste when the bigwigs change direction on you. The biggest wastes I've seen in IT is sudden "changes in direction" where projects that were going fine are shelved to free up people. If you ship quarterly, canceling something to go after the new shiny object wastes up to 25% of your organizations resources for the year. If you ship weekly, it hurts but you can deal with it. If you ship daily, nobody even notices.

So, don't ask how you can ship 10% faster. Ask how you can ship 10x more often. Of course, you won't be able to snap your fingers and instantly get there. There will be certain activities that by themselves take longer than 1/10 of your current delivery cycle. If you are on a two week cycle (10 business days), and you say you want daily releases, you might find you spend 2 days on QA and a day doing the actual release. The only thing that matters is cutting these things by an order of magnitude. Do not fire your QA and release people, that is not productive. Move the bulk of their work out of the critical path.

Think it can't be done? IMVU can ship 50 times a day. Etsy can ship 25 times a day. Flickr can ship 10+ times a day. They've solved these problems. Do what they do. Which brings us to the main course, in my view:

The Technical Practices of DevOps

Here's the contents of the technical bag of tricks that defines DevOps, at least as I see it in mid-2011:
  1. Infrastructure Automation. Use cloud and virtualization. Have standard images. No exceptions. Eliminate questions and "creativity" in the provisioning process. Use puppet and chef. Measure your provisioning time. Cut it to minutes or seconds.
  2. Standardized Runbooks. Each application and service that you build can't have it's own story. Developers don't get to change how their app is started, what it's installation looks like, where its logs go, where it's configuration goes, what container it deploys to. DevOps writes this once and stuff that doesn't comply doesn't ship, because we adopt:
  3. Fully Automated Deployments. The app should be in one artifact, it's configuration in another. The deployer takes one bit of information (the app/service name) and looks for updates in the one standard way. If they exist, they are pulled down and installed. One click deployments, then...
  4. Continuous Deployment. One click deployment is one click too many. Build a pipeline and when all the tests pass, no clicks happen and the code is promoted and installed.
  5. Advanced Test Driven Development. Not just unit and integration tests. I'm talking english language Behavior Driven Development (eg: Cucumber) Including for your UIs. 100% no exceptions. Have your quality/compliance team do audits to make sure. Even this is not enough:
  6. Behavior Driven Infrastructure. Use the test driven concept to loop back to infrastructure automation. Do not fill out request forms to get stuff. Ever, even if it's automated. Write Behavior Driven Infrastructure tests that deploy to your monitoring infrastructure that assert that your environments will be pingable, will be ssh'able, will be https'able, and will have the feature behavior defined by your BDD. Fail the test, provision, pass some / fail some, deploy, pass all tests, monitor forever.
  7. Minimal Marketable Features (MMFs). If it's possible to split your feature, do so. When developers finish stuff, assign them to in-flight MMFs first until those are "full". Stop starting and start finishing. Only pull new features into WIP when forced to because a free developer can't help anything in flight go faster. Management can juggle the roadmap or backlog all they want until it's WIP. 
  8. Ship When Done. I've never understood timeboxed iterations. I call them calendar complacency. Many agile proponents haven't heard: Timeboxed Iterations are Dead.
  9. Runbook Automation - Take common failure modes and automate their responses. Have a socket leak slowly filling up your file handles. No!? Good. Monitoring them and automate the bounce anyway. Have bad memory?!? No!? Good. Automate a from scratch deployment anyway.
  10. Feature Flags and Dark Launches. Every environment besides production is waste. Get rid of it. Keep the virtual segregation of non-production code with feature flags. Prove it works in production before users see it with dark launches, not with expensive "production-like" staging system. Turn broken stuff off by unflipping a bit. 
  11. Perpetual Beta. Let customers control who can see "Beta" stuff. Call this user acceptance testing, so that you can get rid of the waste. Let internal customers pull value by controlling when "beta" ends for a feature. Deliver features fast enough so there is always something in Beta.
  12. Automated Recovery. When web server #3 has some issue, what should the response be? Spin up a new VM, redeploy the app, put it in the pool, and throw away #3. Hone your ability to do this quickly. Measure it in seconds.
  13. Continuous Delivery. Remove all human intervention in the pipeline between the writing of your feature acceptance criteria as a BDD/BDI test and it's handoff to customers when the tests pass.  Understand continuous delivery vs continuous deployment.
  14. Metrics. Measure stuff that matters. Time is money, so measure how long things take. MTTR is critical, so things like failover time, rebuild from scratch time, app start time, need to be measured. Performance is important, so latency and throughput, etc... should be quantified. There's only a couple other useful things: test code coverage is good, cyclomatic complexity of your code is good.
  15. Process Tooling - Have a single source control solution. Allow all techs to see everything in it, across all teams. Invest in CI environments, monitoring/BDI, and both runbook and infrastructure automation tools. DevOps should own administration of these, and be expected to use them to demonstrate waste eliminate in the IT process. DevOps delivers metrics data, as above, and owns the plan and its execution to improve those metrics by leveraging software process tooling.
Does This Stuff Work? What about Criticism?

IMVU can released 50+ times a day, etsy.com releases 25 times a day. Flickr releases 10+ times a day. These are big sites with lots of users and complicated use cases. These software teams have created really strong cooperation between their development and operations teams. They are probably running orders of magnitude more tests than you are in a fraction of the time, because they have focused on removing waste this way for years. So yes, these techniques work. All I've done here is itemize the techniques these guys say work, and explain how they are really just the application of lean thinking to the IT value stream.

However, like any buzzword in IT, "DevOps" has it's critics. Some of the criticisms are standard nay-sayer cynicism. Some criticisms are legitimate warnings not to turn a good idea into a mindless checklist solution. Let's look at some of the common push-back on DevOps.

One criticism is that "elitist sysadmin club to rebrand an existing problem" (see wikipedia for origins). Well, elitism is bad, ok. The fact that sysadmins might be involved is rather expected. Nothing in the culture or technical bag of tricks above requires you to be an elitist sysadmin to be successful, or to be the member of any particular club. Quite the contrary, this bag of tricks is open to anyone seeking to maximize the value IT ships. More correctly though, DevOps may very well be rebranding an existing problem. The fact that half the software industry still ships monthly or longer probably means that the previous brand didn't work too well on them. As long as there is complacency about the gratuitous waste in IT, rebranding the solution makes sense. The point is that the solution -- apply lean thinking to eliminate your waste one project at a time -- works, and is pretty much not likely to be displaced by a better solution.

Another deflection seems to be "we don't need to ship that often". This is just the voice of mediocrity. Solving problems and delivering real improvements is hard, so why bother. These people will fight to avoid change, but not too hard, because they are lazy and true resistance takes work. The CIO and his VPs have to lead. If they are questioning the value of greater agility, either go work somewhere else or wait for them to feel the heat from above. Businesses are demanding faster change and IT can't deliver that with quarterly and monthly release cycles anymore. If they try, they get bruised when the business changes its mind on things and weeks or months of effort is wasted.

Another criticism is that the whole DevOps "movement" is here to sell "books, training, conferences, the whole bit." and that "your organization will not be fixed by some sunshine-up-your-ass methodology you read about in a blog or hear about at a conference". See Ted Dzubia's DevOps is a Poorly Executed Scam blog. There's some truth to the notion that any good idea quickly gets over-marketed by people who peddle buzzword compliance. On the other hand, it's really cynical to suggest that anything used to sell books, training, and conferences doesn't work. Should we stop reading books and stop going to conferences? That's silly. And not everybody is selling something. I'm not. Real people with common problems like to share what works with each other.

Improvement is hard. No expensive talking head in a suit can tell you what subset of the 15 technical techniques above solves the top reason your organization can't ship software faster. But the people that ship 10+ times a day got there doing things on this list. Only you can figure out how to apply it to your organization. So get started.

Fixing Git Commit Messages

Problem: Commits in git are immutable, so how do you recover if you make a mistake when you commit? For example, I forgot to put my task number, "ITSM-197" at the beginning of my commit message. I want to fix this before pushing to a remote repo.

btaylor@vancouver ~/src/change-service (ITSM-197) $ git log --oneline master..
87b9f6b fix response elements outside of method element
3123405 ITSM-197 change approver to approval-level
56ecbb4 ITSM-197 remove duplicate declaration of status attribute
7f10503 ITSM-197 fix change request state machine representation in xsd


Whoops, commit 87b9f6b didn't follow the convention. I fix this with "git commit --amend –m" which replaces the tip of the current branch with a new modified commit. You can leave the –m option off to actually change the commit itself (and keep the existing message).

btaylor@vancouver ~/src/change-service (ITSM-197) $  git commit --amend -m "ITSM-197 fix response elements outside of method element"
[ITSM-197 c7310f8] ITSM-197 fix response elements outside of method element
 1 files changed, 6 insertions(+), 6 deletions(-)

btaylor@vancouver ~/src/change-service (ITSM-197) $ git log --oneline master..
c7310f8 ITSM-197 fix response elements outside of method element
3123405 ITSM-197 change approver to approval-level
56ecbb4 ITSM-197 remove duplicate declaration of status attribute
7f10503 ITSM-197 fix change request state machine representation in xsd

Solved. Note that the chain of commits from the master branch has been rewritten to end with my new commit c7310f8 with my ITSM-197 message added. Let's verify that these have the same tree:

btaylor@vancouver ~/src/change-service (ITSM-197) $ git cat-file commit c7310f8
tree f1c5c976a169c513800d9cd99a776957f503886e
parent 3123405682b942e4875399c66265caad42260d64
author Bryan Taylor <btaylor@nospam.com> 1310325458 -0500
committer Bryan Taylor <btaylor@nospam.com> 1310395725 -0500

ITSM-197 fix response elements outside of method element
 
btaylor@vancouver ~/src/change-service (ITSM-197) $ git cat-file commit 87b9f6b
tree f1c5c976a169c513800d9cd99a776957f503886e
parent 3123405682b942e4875399c66265caad42260d64
author Bryan Taylor <btaylor@nospam.com> 1310325458 -0500
committer Bryan Taylor <btaylor@nospam.com> 1310325458 -0500

fix response elements outside of method element

Yep, they both have tree f1c5c97. Note that the previous 87b9f6b commit still exists, but it's no longer in the commit chain from master to the tip.