View All Videos

Archive for the ‘News’ Category

Checklists: the most unsexy way to save millions

Checklists: the most unsexy way to save millions


Damon Edwards / 

The New Yorker has a great article on the success of using checklists to tame extremely complex systems.

The primary example used in the article is intensive care units in hospitals. Anywhere you see the term “intensive care” substitute “data center” and anywhere you see a name of a medical procedure substitute the name of a technical procedure and the lessons are essentially the same.

What are the lessons?

1. Where checklists have been formalized and rigidly enforced (as a means of documenting and enforcing best practices), millions of dollars have been saved and many deaths (the ultimate “system outage”) have been avoided.

2. The concept of checklists is so simple and unsexy that their awesome saving power is often overlooked. Admit it, your inner geek yawns just thinking about checklists.

How can checklists immediately improve IT operations?

First, agree on your best practices and document them. Second, strictly enforce the rule that all operations activities must follow those procedures. Third, record the completion of each step of the procedure for trouble shooting and analysis.

Sounds like such common sense, doesn’t it? If it is then why do most IT operations fail at implementing such a simple culture of orderly change management?

Book Review: The Visible Ops Handbook


Damon Edwards / 

Operational excellence in IT always seems to be an illusive goal. Attempts you’ll see will often range from the “magic bullet technology” projects that rarely deliver on expectations to the addition of crushing bureaucracy that is quickly circumvented and rendered ineffectual.

With these thoughts in mind, I was leery when I picked up a copy of The Visual Ops Handbook. Wow, was I ever surprised. The Visible Ops Handbook is a compact and highly effective prescription for achieving operational excellency. It won’t get you all the way to the promised land but it will send you down the path on solid footing.

The approach is not about implementing new technology. It’s not about ivory tower bureaucracy. The Visible Ops Handbook is about bringing reliability, accountability, and predictability to your operations through a commonsense based process that doesn’t require heroic discipline or unrealistic political capital to implement.

Who should buy this book? The short answer is “everyone”. For a longer answer I’ll borrow a passage from the book’s introduction:

  • Organizations that have change management processes, but view these processes as overly bureaucratic and diminishing of productivity. There must be more to change management than bureaucracy, good intentions and scarcely attended meetings.
  • Organizations where, deep down, everyone knows that people circumvent proper processes because crippling outages, finger-pointing, and phantom changes run rampant.
  • A “cowboy culture” where seemingly “nimble” behavior has promoted destructive side effects. The sense of agility is all too often a delusion.
  • A “pager culture” where IT operations believes that true control simply is not possible, and that they are doomed to an endless cycle of break/fix triggered by a pager message at late hours of the night.
  • An environment where IT operations and security are constantly in reactive mode with little ability to figure out how to free themselves from fire fighting long enough to invest in any proactive work.
  • Organizations where both internal and external auditors are on a crusade to find out whether proper controls exist and to push madly for implementing new ones where they are not in place.
  • Organizations where IT understands the need for controls, but does not know which controls are needed first.

Yes, they are talking about you.

It’s a short read (100 pages including several appendices), so buy one for everyone in your department. Available in paperback from Amazon or as a PDF from

*Note: the full title is “The Visible Ops Handbook: Implementing ITIL in 4 Practical and Auditable Steps”. In my opinion the fact that ITIL is in the title is a bit misleading. There are some sidebar discussions that draw connections between the Visible Ops process and ITIL, but this is a book about how to succeed in operations first and foremost. I suspect the ITIL connection was made for marketing reasons. Don’t let it taint your opinion before you read the book. a bellweather for new open source models? a bellweather for new open source models?

Damon Edwards / 

Twitter’s legendary outages have driven a significant number of influential bloggers and pundit types to the new service. is interesting not just because it’s a Twitter clone but because it’s an Open Software Service.

An Open Software Service is defined as a service:

1) Whose data is open as defined by the open knowledge definition with the exception that where the data is personal in nature the data need only be made available to the user (i.e. the owner of that account).

2) Whose source code is:
   1. Free/Open Source Software
   2. Made publicly available.

For the majority of users of consumer services like Twitter, Facebook, or GMail, whether or not the service is open probably seems inconsequential. However, when it comes to enterprise web services this could be a very interesting trend. The open data part is likely a non-starter, but the open source aspect opens some interesting doors.

Run it yourself, have someone run it for you, or even more interesting… some combination of the two. The opportunities for real innovation under this model are fascinating. In traditional open source software, “adding value” generally meant you added special features or provided timely code updates for a fee. Under this new model, “adding value” is all about the managed services and network effects that you can provide to end users.

I’ll be eagerly watching to see if they can make a going concern out of being a service provider when anyone can run the service for themselves.

If a CDM, SMI-S, CMDB, DASH, CMI falls in Santa Clara… will anyone hear it?

If a CDM, SMI-S, CMDB, DASH, CMI falls in Santa Clara… will anyone hear it?

Damon Edwards / 

I’ve been asked a number of times lately if I’m going to the Management Developers Conference taking place this November in Santa Clara. The title sure sounds like something right up my alley. Well, at least it did until I checked out the agenda. In a nutshell, this is a conference of vendors talking about the latest in management standards efforts.

I’ve commented before on how the vendor-backed standards efforts are out of touch with what is going on in the IT trenches. This is yet another example. What happened to concept that standards started with consensus on the ground and then matured from there? The ivory tower approach has been a historical failure in our space, why not surprise us all and try a different approach this time?

In every enterprise that contends with more than a handful of servers and applications you’ll find that there’s a “management developer” of some sort. Amazingly, software vendors and standards bodies appear to have little regard for those developers opinions when they meet in their ivory towers to dictate the next round of vendor sports standards. How many of of those real management developers would feel that this conference or these standards efforts (or any of the previous failed efforts that these current efforts are repeating) are relevant to their day to day lives?

Built for operations (update 1)

Built for operations (update 1)


Damon Edwards / 

We’ve previously touched on the trend of operations having an impact on application architecture. Up to this point, the shift towards being “built for operations” manifested itself as subtle organic changes that differed from organization to organization. If you stood back far enough, you can see it as an unmistakable trend but there hasn’t been a common driving force.

The rise of Amazon Web Services, specifically EC2, is a remarkable force that could result in a sea change in the average developer’s assumptions. For example, why do you need persistent local storage on any one machine? In EC2, if you shut off a machine you lose everything on it that isn’t part of the template image used to instantiate it. I can’t get that instance back but I can instantly launch a dozen clones from the same “birth state”. Whoooah… that’s just a little bit different now, isn’t it?

Local writes are lost? Servers are completely built from templates? Launch fully operational clones with the push of a button? The implications of these three simple concepts alone are enough to blow a lot of people’s mental gaskets.

Everyone gives Amazon props for cheap on-demand infrastructure hosting. Perhaps Amazon should get a bit more credit for pushing the art of systems architecture and management forward in a very public and massively appealing way.

Use the word “process” and confusion ensues

Use the word “process” and confusion ensues

Damon Edwards / 

In my last post about the River of News for Monitoring concept, process played a central role. In various conversations I’ve had since writing that post, it’s become clear to me that the word “process” is a tricky and overloaded word. There are lots of processes whirling through an enterprise. There are business processes (customer transactions, supplier transactions, human resources, finance, etc.). There are application processes (pretty self-explanatory). Then there are IT processes. My view of IT processes is that they are the actions that transform your IT assets and their related environments. (Note: obviously you could use an ITIL-like definition born from a standards body at this point, but I’m looking for a simple set of buckets to use without causing further debate)

There are plenty of tools that track and examine business processes. There are plenty of tools that track and examine application processes. However, when it comes to IT process, the available tooling is quite thin. Sure there are tools (e.g. ticketing systems, bug trackers, approval workflows, etc.) that track the HUMAN aspect of your IT processes, but they give you very little visibility into what actually took place at the system or application level. To make matters worse, under their fancy dashboards, most of these systems they rely almost entirely on a human to tell them what was done or observed. In today’s complex and highly distributed environments, it’s almost impossible to get an accurate picture of what really took place using these faulty or outdated techniques.

Skeptical? Give the status quo a test. Walk into any sizable IT operations and ask them to do 2 things:

1. Show you all of the deployment activity, with the context that those activities occurred in, that took place in [pick some slice of their environment] between [pick two arbitrary points in time]. This doesn’t mean things like “Bob said he completed the steps of this process” or “Jane said she ran this set of scripts”. I’m referring to evidence of the actual technical activities that took place across the boxes.

2. Show you the entire lifecycle of [pick an arbitrary application package], from when it was built and packaged to all of its deployments throughout Dev, QA, and Prod environments.

I would be shocked if they had this information readily available. In most cases, they couldn’t produce this information at all. For many companies these systems are their source of revenue generation, their “factory floor” if you will, and they can’t answer these simple questions. In any other industry this would be wholly unacceptable.

This is the situation we are working on changing.

Page 5 of 10First34567Last