View All Videos

Archive for September, 2007

Can your operations survive the “toss test”?

Can your operations survive the “toss test”?


Damon Edwards / 

I’ve heard this a couple of times and it still makes me chuckle. The best part is that it is actually a great litmus test. If anyone knows who came up this this let me know.

How to tell if you’ve automated enough of your operations…

1. Grab any machine, rip it out of the rack, and throw it out of the window. Can you automatically re-provision your systems and return the affected application services to their previous state in minutes? (no cheating by failing over to a standby cluster or alternate facility)

2. Grab any engineer and throw him or her out of the same window. Can your operations proceed as normal?

EDIT: Thanks to a tip by Kris from, we now know that the inventor of this test was Steve Traugott from

Hard Reality: SaaS – Efficiency = Failed Business Model

Damon Edwards / 

As consultants we often see situations where software as a service (SaaS) and e-commerce companies’ business model expectations are felled by cold hard technical realities.

The naive business point of view usually goes something like this (with obvious simplification for illustrative purposes): Revenues will rapidly grow at an exponential rate (either through organic growth, traffic acquisition, or adding new business lines) and costs will slowly grow at a steady linear rate. Sounds great, doesn’t it? Despite protests from the technology group that those aren’t sound fundamentals, executives will surprisingly forge ahead with this kind of thinking when drawing up their business plans and forecasts.

But as things unfold they face the reality that costs are growing just as rapidly as revenue and operational complexity is bogging down the business. In some situations the compounding complexity actually gets so bad that they hit an inflection point where costs and complexity can block further growth. In a few situations, such as companies who are lucky enough to have the “oversized” economics of a hit service, this surprising growth in costs can be masked by tremendous profit margins. But for the rest of world, this inability to get (or widen) that separation between those revenue and cost curves ends up taking much of the sheen off of the SaaS and e-commerce business models.

The knee jerk reaction is the obvious one, tell the technology group to cut its costs! If they haven’t already done so, the first move tends to involve strategies like migrating to open source software on commodity hardware or moving their expensive “resources” (i.e. their people) off-shore. These moves often provide the big slashing cuts that give a CFO instant gratification. However, their weakness lies in the fact that they are essentially a one shot weapon. It doesn’t fix the fundamental problem. Costs are still on the wrong trajectory.

To get to the heart of the problem the focus needs to be placed on making efficiency of operations a core competency. Without these efforts, including the extensive implementation of formalized automation, the long-term success of most SaaS and e-commerce business models will fall into question. And this effort, which doesn’t come without expense, needs to be applied upfront. We see too many cases where automation and efficiency are mere afterthoughts to development and marketing…. until its too late and the business is already under dangerous strain.

Polling dev2ops – How do you maintain system configuration files ?

Polling dev2ops – How do you maintain system configuration files ?


Alex Honor / 

The system administration dev2ops folks in the trenches spend a good deal of their time maintaining environments supporting the development, test and production instances for their business services.
An essential aspect of environment management is maintaining the host’s configurations to support the application. Everyone has their preference as to the approach and toolset they use. Which one do you use and why?

Please don’t forget to leave a comment about your response!

Polling dev2ops

Polling dev2ops

Alex Honor / 

When you think about the multitude of companies that conduct business via software as a service, you’d expect to find a diverse range of underlying business models: CRM, e-tailing, social networking, advertising, etc. But if you’re not in the trenches with the dev2ops folks, you might be amazed at how diversely these companies manage their application environments and the variety of processes and mechanisms they employ.

Lots of research and mature best practice has developed in areas like security, availability and scale. Because these areas are more established and the problems better understood, one can rely on their proven methods and exploit their commodity solutions. There is much less agreement about how IT should technically manage software as a service environments, especially those that change often and are complex, multi-layer, multi-site online software systems.

Sure, there are established methods around scheduling and planning application change as evidenced by the various trouble ticket, bug trackers, and issue management tools out there. There are also emerging IT service management practices and paradigms from IT governance initiatives, such as ITIL. These tools and methodologies mostly focus on “people management” and are essential to coordinate the human activity and workflow. What is surprising, is the lack of consensus on “how” changes are technically performed and “what” is used to perform the changes.

After all, the dev2ops folks in the trenches know it boils down to a series of technical steps: people and or scripted tools distribute various files, set environment configurations, import data, alter stored procedures, synchronize content repositories, check and or modify permissions, stop and start processes, etc., etc. It is at this level of change implementation, where such an astounding variety of tools and procedures can be witnessed.

How and what implements change can even differ between groups within the same company. Developers may use developer tools to update running applications in test environments while operations uses a completely separate set of tools and procedures to maintain production environments. Even further, businesses that operate more than one application, might see this range of practices multiplied. The worst case, is one where there is a different process and set of tools for each combination of application, team, and environment.

Why do we see such a variety of approaches used to change application environments? Does each group need to invent their own procedures and toolsets? Are environments and applications so unique that there must be a customized and optimized solution for each case? Can one establish some basic approaches and paradigms that can be shared across organizational teams, even businesses?

To help answer these questions and better understand why folks choose from such a diverse array of approaches and tools, we here at dev2ops will conduct a series of polls. But, we need your help in the form of participation. We’ll come up with the first round of polls and rely on you to respond and post about your poll selections. You can also send survey ideas to and if that ground hasn’t already been covered, we’ll post those, too.
During the course of the polling, we’ll look for common patterns and pitfalls, some obvious to some folks while others novel to others. In the end, we can look for where there is common agreement and compare our own current practices to those.

So, look out for the polls and vote !

Change automation logs: a key tool for resolving outages

Change automation logs: a key tool for resolving outages


Damon Edwards / 

Todd Hoff at High Scalability wrote an interesting piece on what and how to log. His position is that you essentially need to “log everything all the time”. However, curiously missing from his list of what “everything” means is the full detail on how application release/update procedures impacted the environment.

This is a common problem we see all too often. Extensive system logging efforts but no visibility into the change management processes. Without the complete picture you are spending your time essentially studying symptoms and trying to guess at the root event, rather than quickly identifying the root event and spending your time identifying a solution. Under the pressure of a significant outage you can’t underestimate the value of having the right tools at your disposal.

From my more detailed comment to Todd’s post:

Info you can’t get from normal system and application logs:
1. When did the application change?
2. What was changed? What are all of the code, data, and content assets related to that change?
3. Exactly what procedures were run to produce the change? Who ran the commands? What variables/inputs did the procedures use?
4. What nodes did those procedures touch?
5. What commands can I run to immediately put everything back into a last known good state? (often through a “roll-forward” rather than a true “roll-back” procedure)

The common perception is that it just isn’t possible or practical to collect this kind of data in an automated and authoritative manner. It is, but it depends on the correct choice of build, deployment, and configuration management tooling.