View All Videos
Can your operations survive the “toss test”?

Can your operations survive the “toss test”?


Damon Edwards / 

I’ve heard this a couple of times and it still makes me chuckle. The best part is that it is actually a great litmus test. If anyone knows who came up this this let me know.

How to tell if you’ve automated enough of your operations…

1. Grab any machine, rip it out of the rack, and throw it out of the window. Can you automatically re-provision your systems and return the affected application services to their previous state in minutes? (no cheating by failing over to a standby cluster or alternate facility)

2. Grab any engineer and throw him or her out of the same window. Can your operations proceed as normal?

EDIT: Thanks to a tip by Kris from, we now know that the inventor of this test was Steve Traugott from

Hard Reality: SaaS – Efficiency = Failed Business Model

Damon Edwards / 

As consultants we often see situations where software as a service (SaaS) and e-commerce companies’ business model expectations are felled by cold hard technical realities.

The naive business point of view usually goes something like this (with obvious simplification for illustrative purposes): Revenues will rapidly grow at an exponential rate (either through organic growth, traffic acquisition, or adding new business lines) and costs will slowly grow at a steady linear rate. Sounds great, doesn’t it? Despite protests from the technology group that those aren’t sound fundamentals, executives will surprisingly forge ahead with this kind of thinking when drawing up their business plans and forecasts.

But as things unfold they face the reality that costs are growing just as rapidly as revenue and operational complexity is bogging down the business. In some situations the compounding complexity actually gets so bad that they hit an inflection point where costs and complexity can block further growth. In a few situations, such as companies who are lucky enough to have the “oversized” economics of a hit service, this surprising growth in costs can be masked by tremendous profit margins. But for the rest of world, this inability to get (or widen) that separation between those revenue and cost curves ends up taking much of the sheen off of the SaaS and e-commerce business models.

The knee jerk reaction is the obvious one, tell the technology group to cut its costs! If they haven’t already done so, the first move tends to involve strategies like migrating to open source software on commodity hardware or moving their expensive “resources” (i.e. their people) off-shore. These moves often provide the big slashing cuts that give a CFO instant gratification. However, their weakness lies in the fact that they are essentially a one shot weapon. It doesn’t fix the fundamental problem. Costs are still on the wrong trajectory.

To get to the heart of the problem the focus needs to be placed on making efficiency of operations a core competency. Without these efforts, including the extensive implementation of formalized automation, the long-term success of most SaaS and e-commerce business models will fall into question. And this effort, which doesn’t come without expense, needs to be applied upfront. We see too many cases where automation and efficiency are mere afterthoughts to development and marketing…. until its too late and the business is already under dangerous strain.

Polling dev2ops – How do you maintain system configuration files ?

Polling dev2ops – How do you maintain system configuration files ?


Alex Honor / 

The system administration dev2ops folks in the trenches spend a good deal of their time maintaining environments supporting the development, test and production instances for their business services.
An essential aspect of environment management is maintaining the host’s configurations to support the application. Everyone has their preference as to the approach and toolset they use. Which one do you use and why?

Please don’t forget to leave a comment about your response!

Polling dev2ops

Polling dev2ops

Alex Honor / 

When you think about the multitude of companies that conduct business via software as a service, you’d expect to find a diverse range of underlying business models: CRM, e-tailing, social networking, advertising, etc. But if you’re not in the trenches with the dev2ops folks, you might be amazed at how diversely these companies manage their application environments and the variety of processes and mechanisms they employ.

Lots of research and mature best practice has developed in areas like security, availability and scale. Because these areas are more established and the problems better understood, one can rely on their proven methods and exploit their commodity solutions. There is much less agreement about how IT should technically manage software as a service environments, especially those that change often and are complex, multi-layer, multi-site online software systems.

Sure, there are established methods around scheduling and planning application change as evidenced by the various trouble ticket, bug trackers, and issue management tools out there. There are also emerging IT service management practices and paradigms from IT governance initiatives, such as ITIL. These tools and methodologies mostly focus on “people management” and are essential to coordinate the human activity and workflow. What is surprising, is the lack of consensus on “how” changes are technically performed and “what” is used to perform the changes.

After all, the dev2ops folks in the trenches know it boils down to a series of technical steps: people and or scripted tools distribute various files, set environment configurations, import data, alter stored procedures, synchronize content repositories, check and or modify permissions, stop and start processes, etc., etc. It is at this level of change implementation, where such an astounding variety of tools and procedures can be witnessed.

How and what implements change can even differ between groups within the same company. Developers may use developer tools to update running applications in test environments while operations uses a completely separate set of tools and procedures to maintain production environments. Even further, businesses that operate more than one application, might see this range of practices multiplied. The worst case, is one where there is a different process and set of tools for each combination of application, team, and environment.

Why do we see such a variety of approaches used to change application environments? Does each group need to invent their own procedures and toolsets? Are environments and applications so unique that there must be a customized and optimized solution for each case? Can one establish some basic approaches and paradigms that can be shared across organizational teams, even businesses?

To help answer these questions and better understand why folks choose from such a diverse array of approaches and tools, we here at dev2ops will conduct a series of polls. But, we need your help in the form of participation. We’ll come up with the first round of polls and rely on you to respond and post about your poll selections. You can also send survey ideas to and if that ground hasn’t already been covered, we’ll post those, too.
During the course of the polling, we’ll look for common patterns and pitfalls, some obvious to some folks while others novel to others. In the end, we can look for where there is common agreement and compare our own current practices to those.

So, look out for the polls and vote !

Change automation logs: a key tool for resolving outages

Change automation logs: a key tool for resolving outages


Damon Edwards / 

Todd Hoff at High Scalability wrote an interesting piece on what and how to log. His position is that you essentially need to “log everything all the time”. However, curiously missing from his list of what “everything” means is the full detail on how application release/update procedures impacted the environment.

This is a common problem we see all too often. Extensive system logging efforts but no visibility into the change management processes. Without the complete picture you are spending your time essentially studying symptoms and trying to guess at the root event, rather than quickly identifying the root event and spending your time identifying a solution. Under the pressure of a significant outage you can’t underestimate the value of having the right tools at your disposal.

From my more detailed comment to Todd’s post:

Info you can’t get from normal system and application logs:
1. When did the application change?
2. What was changed? What are all of the code, data, and content assets related to that change?
3. Exactly what procedures were run to produce the change? Who ran the commands? What variables/inputs did the procedures use?
4. What nodes did those procedures touch?
5. What commands can I run to immediately put everything back into a last known good state? (often through a “roll-forward” rather than a true “roll-back” procedure)

The common perception is that it just isn’t possible or practical to collect this kind of data in an automated and authoritative manner. It is, but it depends on the correct choice of build, deployment, and configuration management tooling.

Dev sees the world one way while Ops sees it a different way

Dev sees the world one way while Ops sees it a different way


Damon Edwards / 

There is an old saying that if all you have is a hammer, everything begins to look like a nail.

Much of the dev2ops problem comes from Dev seeing the world one way while Ops sees it a different way. When it comes to developing, deploying, and supporting what the business owners would call an application, these two world views often spectacularly collide.

To the dev folks, their view of the world is all about the build (and its related dependencies) along with data/content requirements (schema, standing data, catalog/user data, web assets, etc…). They see the application from the inside looking out, fully aware of its internal components and inner workings. The lifecycle they care about begins with business requirements, continues through a set of related builds, and then on to an integrated running service. Their goal is to then promote this service from one dev and test environment to another until it becomes someone else’s problem (that’s usually the ill-fated handoff to ops). Supporting, and in many ways enforcing, this point of view are the day-to-day Dev tools: software version control systems, build tools, IDEs, requirements management tools, etc.

To the ops folks, their view of the world is all about the physical architecture, the network, and the box. Applications? Ops doesn’t want to know about the the inner workings of the applications. They see applications from the outside looking in, as just another set of files that are part of the puzzle they manage. Environments? Sets of distinct boxes wherever possible (sometimes divided by datacenter, VLAN, racks, etc…). The lifecycle Ops cares about takes a much different form than the one that Dev cares about. Quality of service, capacity, and availability are the driving factors for Ops. Ops establishes the needed hardware and software platform profiles in a lab setting and then uses those profiles to build and refresh the different environments they control (generally staging, production, and disaster recovery). The preferred Ops toolsets, usually EMS systems (Tivoli, Opsware, OpenView, OpenNMS, BladeLogic, etc..), support and enforce this point of view.

Unfortunately, aligning these two world views is just as difficult as aligning their respective toolsets.

It is all too common that users (and by extension, vendors) from both groups want to force their tools and point of view on the other group. But no matter which direction the management mandate ultimately comes from, the truth lies just beneath the surface. Dev wants nothing to do with the Ops tools and Ops wants nothing to do with the Dev tools. They see little value in the other group’s tooling of choice because they think it will add additional complexity to their lives and won’t help them to complete their core jobs. Because of these dramatically differing points of view, the conversations in the trenches will often revolve around how it’s “the the other group’s fault” and why “they just don’t get it”.

Fundamentally, the Dev and Ops groups are at odds because there is a difference of primary concerns. Dev doesn’t want to know about networks and boxes. In most situations Dev really don’t even really care where something is deployed until they get a late night phone call from Ops to debug a mysterious problem. Ops, on the other hand, just wants “operations ready” application releases and are frustrated that they never seem to get that.

Sadly, too much corporate blood has been shed over a collision of differing points of view that are both equally valid.

Page 25 of 26First2223242526