View All Videos

Archive for the ‘DevOps’ Category

Clouds, Virtualization, and Continuous Deployment all share the same achilles heal

Clouds, Virtualization, and Continuous Deployment all share the same achilles heal


Damon Edwards / 

Recently, there are 3 “hot trends” that we regularly get asked about in our day jobs as web operations consultants:


  • Cloud Computing (meaning elastic computing resources paid for on-demand)
  • Virtualization in Production (meaning using virtual machines for non-development or QA uses)
  • Continuous Deployment (meaning the ability to automatically deploy and test a full environment automatically after each and every build that comes out of their Continuous Integration driven build process)

There is a common thread that ties all of these together — Fully Automated Provisioning**. You can’t achieve the full benefit of any of these advances without Fully Automated Provisioning.


In this previous post, we covered the reasons why efforts to harness the power of cloud or virtualized infrastructure will fail without Fully Automated Provisioning.

Continuous Deployment suffers from a similar weakness. If you don’t have Fully Automated Provisioning in place, the efforts required to provision your applications and sort out the resulting problems will outweigh whatever benefits you set out to gain.

IT automation may not be the sexiest field. However, IT automation (and specifically Fully Automated Provisioning) is the necessary foundation that lets you continually reap the benefits of the latest headline grabbing initiatives.

**To read about what the criteria is for achieving Fully Automated Provisioning, check out this blog post and whitepaper.

Looking for “cross-over people” at O’Reilly Velocity 08

Looking for “cross-over people” at O’Reilly Velocity 08


Damon Edwards / 

In the comments section of a previous post, Berkay nicely summed up why it is so difficult to solve the development to operations problem:

“There are very few people that have the crossover skills. Developers who have operations experience/knowhow and operations people who have development/deployment experience is rare. Further there are organizational silos enforcing this divide.”

The observed gap between development personnel and operations personnel is a subject we’ve touched on before. Much of the success of running an efficient business based on online services depends on closing this gap.


The first step to closing the development to operations gap is getting everyone talking and establishing a common vocabulary. Any event that promotes these types of discussions are a good thing in our book.

O’Reilly’s new conference, Velocity, should be a good forum to hold these conversations. Alex and I will be attending Velocity and will also be participating as exhibitors (with ControlTier). The bulk of the conference agenda is focused on infrastructure design and management rather than application deployment and management, but the presence of both is a good sign.

If you are attending the conference, we look forward to meeting you. If you aren’t registered yet, you can use the code “vel08js” for 20% off.

Package-Centric Application Release Methodology: What is it?

Package-Centric Application Release Methodology: What is it?


Alex Honor / 

Spend some time in the trenches at a software as a service (SaaS) or e-commerce company and you’ll find a prevailing opinion that traditional release management processes don’t keep up with the rapid pace of application changes and rising system complexity. SaaS and e-commerce environments are the ultimate example of IT moving from being a supporting cost center to the actual source of revenue production. For SaaS and e-commerce companies, the application release process is what governs their “factory floor” and this process needs to be run with as much predictability, measurability, and automated efficiency as any modern manufacturing process. To make this governing process run smoothly you need to handoff well defined units of change from one part of the process to the next, and this is where a package-centric application release methodology comes into play. The package-centric application release methodology augments traditional change management to help implement and execute change.

Traditional vs. Package-Centric Methodologies
Traditional release management methodologies are dominated by human process definition and organizational workflows that govern how change tasks should be controlled and who should carry it out. Toolsets to support these traditional methodologies are centered around task approval, auditing, and planning. In other words, a traditional release management methodology is all about managing people and their activities. These human-oriented workflows are important, but on their own they do little to solve the inefficiencies and complexities that plague technicians in SaaS and e-commerce companies when the time comes to execute procedure to carry out those changes. Plainly put, you can get a good handle on what all your people are doing, but you don’t get a good handle on how their changes are technically performed.

In contrast, a package-centric release management methodology focuses on how to facilitate collaboration of decentralized groups to coordinate the implementation of change. Under this methodology, the product of each team is released as a standard unit, one that contains instructions on how to apply the change as well as any relevant content needed for the change. These standardized units of change enable an automated change management approach that rationalizes and coordinates changes originating from disparate teams but each potentially effecting the greater application service operation. In other words, package-centric release management methodology focuses on format, modularity, mechanisms, and technical workflow all in the context of the larger integrated application.

The basic element of the package-centric methodology is, not surprisingly, the package. The package concept prescribes a standard unit of distribution as well as standardized methods for installation and removal. The essential benefit of formal packaging is the provision for changes to be predictably migrated, done, and undone. A package also carries with it essential information like what version it is, who made it, and what does it depend on. Ultimately, packages become the common currency of change within the IT organization.

What belongs in a package?
One often imagines a release to be a distribution of an entire application. In reality, for online systems, change happens at a much more granular level. One typically observes several kinds of changes released to live environments:

  • Code: Application files executed by the runtime system. This could be compiled objects or interpreted script.
  • Platform: Files that comprise the runtime layer. This is generally server software like Apache, JBoss, Oracle, etc.
  • Content: Non-executed files containing information. These could be media files or static text.
  • Configuration: Files defining the structure and settings of an online service. These could be files for configuring the runtime system or the application.
  • Data: Files containing data or procedures for defining data. These could be database schema dumps, SQL scripts, .csv files, etc.
  • Control: Configuration and procedures consumed by management frameworks

Within the package-centric paradigm, each type of change noted above is bundled, distributed and executed via a package.


Standard vehicle of change
The package serves as a vehicle of change, both as a unit of transmission between teams, but also as a standard means to affect the change within the context of the running application service. The package concept helps cope with rising complexity by encapsulating change-specific procedures and underlying tools behind the package’s standard interfaces. In an object-oriented fashion, the package provides the standard interfaces and encapsulation needed to run efficient operations. Teams no longer have to understand the internal structure and procedures in order to apply another team’s change.

Scaling up
Because packages establish a standard interface to distribute and execute change, an IT organization can more readily scale up operations. Knowledge transfer no longer needs to be done between individuals but can instead be explicitly specified within the package. By declaring the essential knowledge and process within the package, teams can more quickly execute change by avoiding repeated inter-personal communication. Because changes are packaged, they can be stored in a repository to facilitate auditing and reporting. Package-centric change also makes it possible to leverage a generalized automation layer to rationally execute changes en masse. Automation is key to gaining the efficiency needed to run any SaaS or e-ecommerce application.

The human workflow emphasized by traditional release management methodology is an important component to governing the business service’s “factory floor” . The package-centric methodology picks up where the activity planning and coordination leave off by assisting the technicians to distribute and apply change.

In future posts I’ll discuss the various aspects, benefits, and examples of the Package Centric Release Methodology.

Can your operations survive the “toss test”?

Can your operations survive the “toss test”?


Damon Edwards / 

I’ve heard this a couple of times and it still makes me chuckle. The best part is that it is actually a great litmus test. If anyone knows who came up this this let me know.

How to tell if you’ve automated enough of your operations…

1. Grab any machine, rip it out of the rack, and throw it out of the window. Can you automatically re-provision your systems and return the affected application services to their previous state in minutes? (no cheating by failing over to a standby cluster or alternate facility)

2. Grab any engineer and throw him or her out of the same window. Can your operations proceed as normal?

EDIT: Thanks to a tip by Kris from, we now know that the inventor of this test was Steve Traugott from

Dev sees the world one way while Ops sees it a different way

Dev sees the world one way while Ops sees it a different way


Damon Edwards / 

There is an old saying that if all you have is a hammer, everything begins to look like a nail.

Much of the dev2ops problem comes from Dev seeing the world one way while Ops sees it a different way. When it comes to developing, deploying, and supporting what the business owners would call an application, these two world views often spectacularly collide.

To the dev folks, their view of the world is all about the build (and its related dependencies) along with data/content requirements (schema, standing data, catalog/user data, web assets, etc…). They see the application from the inside looking out, fully aware of its internal components and inner workings. The lifecycle they care about begins with business requirements, continues through a set of related builds, and then on to an integrated running service. Their goal is to then promote this service from one dev and test environment to another until it becomes someone else’s problem (that’s usually the ill-fated handoff to ops). Supporting, and in many ways enforcing, this point of view are the day-to-day Dev tools: software version control systems, build tools, IDEs, requirements management tools, etc.

To the ops folks, their view of the world is all about the physical architecture, the network, and the box. Applications? Ops doesn’t want to know about the the inner workings of the applications. They see applications from the outside looking in, as just another set of files that are part of the puzzle they manage. Environments? Sets of distinct boxes wherever possible (sometimes divided by datacenter, VLAN, racks, etc…). The lifecycle Ops cares about takes a much different form than the one that Dev cares about. Quality of service, capacity, and availability are the driving factors for Ops. Ops establishes the needed hardware and software platform profiles in a lab setting and then uses those profiles to build and refresh the different environments they control (generally staging, production, and disaster recovery). The preferred Ops toolsets, usually EMS systems (Tivoli, Opsware, OpenView, OpenNMS, BladeLogic, etc..), support and enforce this point of view.

Unfortunately, aligning these two world views is just as difficult as aligning their respective toolsets.

It is all too common that users (and by extension, vendors) from both groups want to force their tools and point of view on the other group. But no matter which direction the management mandate ultimately comes from, the truth lies just beneath the surface. Dev wants nothing to do with the Ops tools and Ops wants nothing to do with the Dev tools. They see little value in the other group’s tooling of choice because they think it will add additional complexity to their lives and won’t help them to complete their core jobs. Because of these dramatically differing points of view, the conversations in the trenches will often revolve around how it’s “the the other group’s fault” and why “they just don’t get it”.

Fundamentally, the Dev and Ops groups are at odds because there is a difference of primary concerns. Dev doesn’t want to know about networks and boxes. In most situations Dev really don’t even really care where something is deployed until they get a late night phone call from Ops to debug a mysterious problem. Ops, on the other hand, just wants “operations ready” application releases and are frustrated that they never seem to get that.

Sadly, too much corporate blood has been shed over a collision of differing points of view that are both equally valid.

dev2ops goes wrong: familiar story in the trenches

dev2ops goes wrong: familiar story in the trenches


Alex Honor / 

Being a consultant I get a wide exposure to real life dev2ops experiences at both large and small companies and usually get called in the aftermath of a bad deployment. Businesses of course accomplish dev2ops in their own ways but run into trouble as their applications and operations become more complicated. I always like to start by listening to the key dev2ops individuals to get the story from the person on the ground. These key dev2ops guys recount something like this:

The big application upgrade was scheduled for 8pm but ended up being at 11, due to some last minute bugs found in QA. After the fixes were committed to the version control system, new software builds were created. These builds were a bit different from what had been discussed in the pre-deployment planning meetings. They also included some new configuration rules and relied on some manual tweaks outside of the normal procedure.


Last minute changes were understandable, since the development team was rushing to meet a project deadline and had been working within an aggressive and compressed time frame. Unit development had been moving along steadily but the last stage integrations were problematic and relied on senior developers to work out kinks in order to get the necessary disparate builds working in QA.

By 12am, the production updates were well underway. There are a lot of machines and files to distribute to them, along with various system commands that need to be run. The normal update process is semi-automated but still very intensive. One must be very careful when performing these updates — one misstep can slow down the whole upgrade process. After all the binary packages were installed, and SQL scripts executed, manual tweaks performed, all parts of the application were restarted to bring the whole site to use the new code and configuration. Subsequent testing of the application uncovered that some new features did not work. The cause was not apparent and suggested faulty application code. A key developer diagnosed the issue, finding the hastily written manual procedures missed a few configuration file changes. The missed edits were applied, the application was restarted again and this time testing confirmed all new features were working. By 2am, all user load was directed to the general production environment, and everybody involved with the upgrade went home. The night shift in operations would monitor the application and let everyone know if any issues arose.

At 7am, users started logging into the application and began conducting transactions. As more logged in, random errors began to occur. Users begun reporting error messages in their web browsers. Operations noticed load spiking on the production machines. The application’s quality of service continued to deteriorate to the point where the problem was escalated to the key individuals involved during the night deployment.

By 7:30am, operations was troubleshooting the problem within the system and network layers while development was logged into several production machines looking at application log files. At 8am, the whole application appeared to degrade into a giant, unresponsive CPU and network consuming monster. System administrators observed full process tables. Network admins discovered unresponsive ports. Developers read off application stack traces. All able hands on deck were requested to join a bridge conference call and visibility of the problem had made its way to the “C” level management ranks. Everyone began asking each other what might be the root cause. Was it a change in the new software or some other possibly unrelated change?

The development group was not the only busy team during that past few weeks. The operations team was making various updates to the infrastructure, some in preparation for the big application update, others for good proactive maintenance. These changes — new firewall rules, operating system updates, security patches, and web server configurations — were scheduled and implemented at different times and none had a negative impact. But, each one would now be suspect.

At 9am, system administrators and developers were logged into production machines, furiously hacking configuration files, undoing bits and pieces of the new application, and restarting the application’s server processes. Caution was thrown to the wind, as those attempting to resolve the problem made ever more daring steps to restore service. Application response sporadically improved but the whole system did not become stable.

Finally, at 9:30 management makes a dreaded yet inevitable call: back out the new application code and bring the site back to how it worked the day before. Naturally, everyone was reluctant to attempt reverting back to the old versions because it’s a tricky and intensive procedure under normal circumstances. On the other hand, application quality of service was so bad, what would be the real user impact?

Around noon, after two and a half hours of mammoth effort, the whole production environment was running on the old versions and the site was operating normally. 5 hours of business service had been interrupted. Postmortem meetings held by management revealed that the planned upgrade was affected by an incompatibility between a newly patched system library applied to production machines that differed from what is used in development and QA machines. Further, during the troubleshooting, someone discovered that several machines appeared to have incorrect configuration files. Finally, the analysis highlighted that backing out an application change was extremely difficult, error prone and took much too long.

I have heard this kind of story at big and small companies by people that subscribe to formal methodologies and from those that eschew them. Sometimes the severity and scope of the problems vary, some cases there is little to no user impact while others it’s a total site outage.


Anyone that has experienced a nightmare like this knows it is quite hair raising, a good cause of burnout and certainly tests the morale and team play in any IT organization. Reflecting on the chain of events, one realizes that the problem does not just boil down to a lack of project management, administrative policy, nor technology. But, there is an obvious lack of coordination, coordination in the most general sense. How an IT organization aligns itself to directly support the dev2ops process certainly is a factor in avoiding deployment nightmares.

Page 13 of 14First1011121314