January 2010 - dev2ops

Archive for January, 2010

Scheduled Downtime vs. Seamless Conversion

Lee Thompson / January 30th, 2010

I fly a lot. I did before I was a consultant, but as a consultant, my dependency on air service is very important. American Airlines effectively fired me as a customer when they dropped the proverbial “nerd bird” flight last year which was the Austin to San Jose non-stop flight. The last few nerd bird flights there was lots of on board discussion on Central Texas to Bay Area route options after the sudden termination notice we were given by American. Most of the folks I spoke with had 1 to 2.5 million miles flown on AA. That is a huge amount of butt time 500+ mph chairs!

When I’m on the planes I frequently work. But many times I schedule some personal downtime and have a gin-n-tonic and watch a DVD. Everyone needs scheduled downtime. Most of you Dev2Ops readers are probably burning some of your personal scheduled downtime right now reading this post. I assume Dev2Ops readers also don’t get a lot of scheduled downtime for the online business infrastructure they support (if ever!). So I was blown away when I got this email from my new transportation services vendor I hired…

First off, let me say jetBlue has a great service picking up poorly serviced travel customers like the recently terminated nerd bird flock and their prices are just so reasonable. The in flight TV is nice so you can do things like get the latest news from Haiti. JetBlue is donating their services to get personnel and supplies in which is just fantastic to see.

But now the eBusiness technologist part of my brain kicks in and says “WHAT!”. An airline business running without a ticketing system for two days intentionally. Scheduled downtime; are you crazy? I can only imagine what kind of bind jetBlue is in with their technical infrastructure. Surely the CEO signed off on such a project plan so the issue must be nasty. I’m left with so many questions. Why can’t you phase in the new ticketing system by route and gradually obsolete the older system? Why can’t you do what Facebook describes as “Dark Launches”. Why can’t the new system run 5 minutes after the old system is powered off?

I’m asking the wrong questions.

You have to take your technologist hat off and put on your business hat on and ask different questions. What is the cost difference between a seamless conversion and a scheduled outage conversion? Would jetBlue have to raise air-fares or ask the shareholders to suffer losses? Does the complexity of the requirements to implement a seamless conversion put the conversion out an extra year hindering the business? Does the added complexity of a seamless conversion add tremendous risk to business operations? Having done numerous transaction system conversions in my career, I can easily say a seamless conversion is probably 4x the size of a scheduled outage conversion (or more). Minimizing complexity substantially reduces risk by a greater than the 4x rate as the relationship to risk and complexity is non linear! Case in point, if the business singed up for the added expense, and schedule delay, what business impact would occur if the technology effort failed??? I would imagine given the above email, the risk was too great.

I’m about to head to the SFO airport and jump on a jetBlue flight who by that time won’t have a ticketing system. My travel schedule makes me personally interested in jetBlue’s success in this conversion obviously!

BTW – I choose United non-stop to San Francisco and jetBlue non-stop on the way back. Good news… Alaska Airline service picks up mid March to San Jose!

Please welcome dev2ops’s newest blogger…. Lee Thompson

Damon Edwards / January 30th, 2010

Lee Thompson is joining us as a dev2ops contributor.

Lee is a currently a consultant specializing in development and operations practices for large scale and mission critical web applications. His current clients include household names in the banking, social networking, and e-commerce fields.

Previously, Lee was the Chief Technologist of E*TRADE Financial. To learn more about Lee, you might want to check out this interview we did a few months back.

Lee has seen the world from the Dev side, the Ops side, and everything in between. Alex and I are please to welcome him to the dev2ops community and look forward to his contributions.

How to measure the impact of IT operations on your business (Part 2)

Damon Edwards / January 22nd, 2010

Part 1: Putting a metrics/KPI program into place in 6 steps

Part 2: Identifying candidate KPIs to evaluate

In my first post in this series, I went through the six steps for putting into place a metrics/KPI program that measures the performance of your IT operations within the context of your business goals.

When consulting, this is usually the point where I stress that we have to work the process in order to come up with KPIs that mean something to your specific business. I explain that there is no such thing as one size fits all in this matter. Despite that, the very next question I’m usually asked: “Can you tell us now what KPIs a company like ours should be measuring?”

Just providing a list of example would probably send them off on the wrong course by chasing KPIs that were important to someone else’s business. Since figuring out what to measure can be as valuable as the actual measurement, I instead walk them through the following concepts to get them started on step #2 and step #3 of the process.

First, stop and consider what “measurement” really means

Measurement: a set of observations that reduce uncertainty where the result is expressed as a quantity

I lifted the above definition from measurement guru, Douglas W. Hubbard. However, if you noodle around in the academic writings on this topic, you’ll see that it’s a fairly accepted definition.

When looking for a way to measure something, keep this definition in mind. Whether its problem solving, allocating budget, or prioritizing your resources, reducing uncertainty gives you a decisive and valuable advantage. You don’t need to have absolute precision. A coarse swing at something is often going to be enough to get started reducing uncertainty and providing business value.

Don’t forget to consider that not every measurement has to be expressed as a simple number (e.g. “137 occurrences” or “83.2% of the time”). You can measure things on an ordinal scale (e.g. “this is less than that” or “this gets 3 out of 4 stars”). You can use nominal measurements where you are are only considering membership in a set (e.g. “this action is in category x, that action is in category y”). Yes/No questions are a valid kind of measurement. You should even consider using subjective methods of measurement (e.g. “do you feel this week was better than last week?”).

Also, don’t expect that every measurement will be made at the same time interval. Sometimes it makes sense to measure certain things on a daily basis. Sometimes it makes sense to measure other things on a quarterly basis.

No matter what type of measurement you end up employing, make sure that it is clear to everyone — even the casual observer — how and why you are measuring something . This is critical for gaining buy-in and avoiding gaming (which both seem like excellent topics for future posts in this series!)

Then use “The Four Buckets” as a guide to start looking for candidate KPIs

At the end of the KPI development process, you are going to be tracking a small set of KPIs that best measure the performance of you IT operations in it’s role supporting your business’s goals. But to get there, you need to start with a larger pool of candidate KPIs. In my experience, most useful measurements tend to fall into one or more of the following categories.

I call these “The Four Buckets”.

Again, keep in mind that at this stage you are looking to surface possible KPIs that will be feed into the rest of the process. The end result will only be a small subset of what you started with (5 – 10 at the most!)

1. Resource Utilization – How resources are allocated and how efficiently they are used. Usually we’re talking about people, but other kinds of resources can fall into this bucket as well.

How much time do developers and administrators spend on build and deployment activity?
How much productivity is lost to problems and bottlenecks? What is the ripple effect of that?
What’s the ratio of ad-hoc change or service recovery activity to planned change?
What’s the cost of moving a unit of change through your lifecycle?
What’s the mean time to diagnose a service outage? Mean time to repair?
What was the true cost of each build or deployment problem (resource and schedule impact)?
What percentage of Development driven changes require Operations to edit/change procedures or edit/change automation?
How much management time is spent dealing with build and deployment problems or change management overhead?
Can Development and QA successfully deploy their own environments? How long does it take per deployment?
How much of your team’s time is spent recreating and maintaining software infrastructure that already exists elsewhere?

2. Failure Rates – Looking at how often processes, services, or hardware fail is a pretty obvious area of measurement.

What was the ratio of successful builds to failed or problematic builds?
What is the ratio of build problems due to poor code vs poor build configuration?
What was the ratio of successful deployments to failed or problematic deployments?
What is the ratio of deployment problems due to poor code vs poor deployment configuration or execution?
What is the mean time between failures?

3. Operations Throughput – The volume and rate at which change moves through your development to operations pipeline.

How long does it take to get a release from development, through testing, and into production?
How much of that is actual testing time, deployment time, handoff time, or waiting?
How many releases can you successfully deploy per period?
How many successful individual change requests can your operations team handle per period?
Are any build and deployment activities the rate limiting step of your application lifecycle? How does that limit impact your business?
How many simultaneous changes can your team safely handle?
What is business’ perceived “wait time” from code completion to production deployment of a feature?

4. Agility – This looks at how quickly and efficiently your IT operations can react to changes in the needs of your business. This can include change driven by internal or external business pressures. There is often considerable overlap with bucket 3, however this bucket is focused more on changing/scaling processes than it is on the throughput of those processes once in place. (Of course, you can always argue that all four buckets play some role in enabling a business to be more “agile”.)

How quickly can you scale up or scale down capacity to meet changing business demands?
What’s the change management overhead associated increasing/decreasing capacity? What’s the risk?
How quickly and what would it cost to adapt your build and deployment systems to automate any new applications or acquired business lines?
What would it cost you to handle a x% growth in the number of applications or business lines (direct resource assignment plus any attention drain from other staff)?
Could your IT operations handle a x% growth in number of applications or business lines? (i.e. could it even be done?)

dev2ops

Archive for January, 2010

Scheduled Downtime vs. Seamless Conversion

Please welcome dev2ops’s newest blogger…. Lee Thompson

How to measure the impact of IT operations on your business (Part 2)

Get new posts by email

Browse

Dev2Ops Authors on Twitter

Archives