View All Videos

Archive for the ‘Q&A’ Category

Video Q&A: Aaron Peterson and Kevin Gray on self-healing infrastructure

1

Damon Edwards / 

At LISA 2010, I caught up with Aaron Peterson (Opscode) and Kevin Gray (Dyn) after they gave a very interesting presentation/demo called “DevOps Gameday“.

From the title, I think a number of attendees were expecting to see the standard Dev to Ops promotion/deployment of code that is so common to the DevOps discussion. Instead the presenters (Opscode, Zenoss, Dyn Inc.) focused on what happens when you have a failure after the code has been deployed. This demo was about self-healing infrastructure… breaking a multi-node system and having it heal itself.

Of course, this kind of canned demo isn’t all that new in the vendor world. However, what is very interesting about their efforts is they want to capture the best practices required to do it and share the code with the world through their combined project (hosted on GitHub). 

If they fulfill the mission of their open project, it’s exactly the kind of “here is how you can do what the big players do” sharing that is good for our industry. 

 

Videos from DevOps Day 2010 panels!

3

Damon Edwards / 

InfoQ.com has posted the videos they recorded at DevOps Day USA 2010. You can watch six of the seven panels now on the InfoQ.com site. There was a production problem with the seventh panel (“DevOps outside of WebOps”) that, if it can be fixed, will be posted as well. InfoQ decided that the lightening talks didn’t fit into their format so they have sent my co-organizer, Andrew Shafer the raw video and he’s going to look into posting them himself.

You can also download audio only versions (.mp3)

Here are the links to the 6 panels…

Your mileage may vary: Experiences and lessons learned facing DevOps problems in the IT trenches (even if they weren’t calling it DevOps!). The good, the bad, the surprises, and ideas for the future.
Stefan Apitz – LinkedIn
Ernest Muller – National Instruments
Dan Nemec – SilverPop
Burzin Engineer – Shopzilla
Kevin Rae – PowerReviews
moderator: Andrew Shafer
http://www.infoq.com/presentations/your-mileage-may-vary
 
 
 
Infrastructure as code: Automation is essential to DevOps. The infrastructure as code concept drives many of today’s cutting edge automaton techniques. What is it all about? Where are its limitations?
Theo Schlossnagle – OmniTI
Luke Kanies – Puppet Labs
Adam Jacob – Opscode
Erik Troan – rPath
moderator: Patrick Debois
 
 
 
Changing culture to enable DevOps: Changing tools is easy when compared to changing people and processes. How can we cultivate an organization’s culture to identify and solve DevOps problems?
John Allspaw – Etsy
Lee Thompson – DTO Solutions
Israel Gat – The Agile Executive
Lloyd Taylor – Netelder Associates
moderator: Andrew Shafer
http://www.infoq.com/presentations/changing-culture-to-enable-DevOps
 
 
Does the Cloud needs DevOps? Does DevOps need the Cloud?: Examining the role that cloud technologies can play in solving DevOps problems and the role that DevOps solutions can play in getting the most out of cloud technologies.
James Urquhart – Cisco
Adrian Cole – Jclouds
Justin Dean – Shopzilla
Joe Arnold – Cloudscaling
moderator: John Willis
http://www.infoq.com/presentations/does-Cloud-need-DevOps
 
 
 
DevOps requires visibility: monitoring, testing, and performance: Examining the (often overlooked) role of monitoring and testing techniques in solving DevOps problems.
Jyoti Bansal – AppDynamics
Gareth Bowles – Appscio
Matt Ray – Zenoss
Eishay Smith – kaChing
Javier Soltero – SpringSource
moderator: Damon Edwards
http://www.infoq.com/presentations/DevOps-requires-visibility
 
 
Making the business case: We know that solving DevOps problems improves your business operations and improves the bottom line, but how do you do you explain that to your CEO or CFO? How do you get the executives to buy in and invest in DevOps solutions?
Kurt Milne – IT Process Institute
Jay Lyman – The 451 Group
Rolf Andrew Russell – ThoughtWorks
Jody Mulkey – Shopzilla
moderator: Damon Edwards
http://www.infoq.com/presentations/Making-the-business-case

 

EDIT: The recording for seventh panel was rescued from technical oblivion and is now live!…

DevOps outside of Web Operations: Much of the public discussion about DevOps focuses on Web Operations. This panel is about taking the lessons of DevOps to other types of IT.
Adam Fletcher – ITA Software
Gene Kim – Tripwire
Michael Stahnke –
James Turnbull – Puppet Labs
John Willis – Opscode
moderator: Patrick Debois
http://www.infoq.com/presentations/DevOps-outside-Web-Operations

DevOps Cafe Podcast now available on iTunes!

Damon Edwards / 

John Willis (johnmwillis.com, @botchagalupe on Twitter, and VP of Services at Opscode) and I (@damonedwards on Twitter and President of DTO Solutions) have started a new podcast. We call it the DevOps Cafe. The name is a take on the popular Cloud Cafe series that John used to do.

Our primary goal for the DevOps Cafe podcast? Explore the emerging fields of DevOps and Agile Operations. While you couldn’t stop John and I from adding our own commentary and experiences if you tried, this will primarily be an interview driven show. We are going to seek out the people in the trenches who are pioneering these new trends and bring them directly to you on a regular basis.

Our secondary goal? To have some fun.

The first two episodes are now available:

Episode 1 – Guest: Lindsay Holmwood (DevOps Days Down Under organizer; Cucumber-Nagios and Flapjack developer)

Episode 2 – Guest: John Allspaw (VP of Technical Operations at Etsy; Frequent public speaker and author)

 

 

http://devopscafe.org

 

 

Q&A: Continuous Deployment is a reality at kaChing

Q&A: Continuous Deployment is a reality at kaChing

5

Lee Thompson / 

Update: KaChing is now called Wealthfront! Their excellent engineering blog is now http://eng.wealthfront.com/

kaChing invited me over to their Palo Alto office last week and I sat down with Pascal-Louis Perez (VP of Engineering & CTO) and Eishay Smith (Director of Engineering) to talk shop.  

I learned about continuous deployment, business immune systems, test randomization, external constraint protection, how code can be thought of as “inventory”, and that the ice cream parlor across the street from kaChing has great cherries! Below is a transcript of our chat.

Note: Tomorrow night (May 26) , kaChing will be presenting on Continuous Deployment at SDForum’s Software Architecture & Modeling SIG at LinkedIn’s campus!

 

                  
Pascal-Louis Perez
VP of Engineering & CTO
                Eishay Smith
                Director of Engineering

 

Lee:
Eric Ries, possibly Continuous Deployment’s biggest advocate, is a tech advisor for kaChing and he’s done a few startups himself! Eric mentions Continuous Deployment (CD) within a concept called the Lean Startup. It’s possibly the best business reason I’ve heard to do continuous deployment, giving you more iteration capability to get the product right in the customer’s perspective.  You’re doing a startup with kaChing and blogged about your CD a few weeks ago.  It’s great to be here today to learn a bit more about your CD implementation!

Pascal:
Yeah, and I think that Eric really made it popular at IMVU, which has probably one of the most famous continuous deployment system kind of 20-minute fully automated commit to production cycle. Very early on Eric and I started spending some time together.  He was kaChing’s tech advisor almost from the start.  He helped drive the vision of engineering and clearly conveyed why you would care about continuous deployment and why it’s such a natural step from agile planning to agile engineering. I think many people confuse agile planning with agile engineering.  Agile engineering is the ability to ensure your code is correct quickly and achieve very quick iteration, like three minutes or five minutes.  You commit; You know the system is okay.

Lee:
Before you forget about what you just changed. 

Pascal:
Right.  Exactly.  So continuous deployment in that sense is just one more automation step that builds on top of testing practices, continuous testing, continuous build, very organized operational infrastructure, and then continuous deployment is kind of the cherry on top of the cake, but it requires a lot of methodology at every level.

Lee:
Testing, you talked about that a lot in the blog that testing is a major, major part of continuous deployment.

Eishay:
Without fast tests, and a lot of them, continuous deployment is very hard to achieve in practice.

Lee:
And unit versus integration testing both obviously I would think?

Eishay:
Yeah.  We focused mostly on the unit.  We test almost every part of our infrastructure, so integration tests are also important but not as much as unit tests.

Lee:
We were talking offline a little bit about the large number of tests and that you randomize the order in which they are ran.

Eishay:
We randomize them after every commit; And we have small commits.

Pascal:
And we’re trunk stable, so everybody develops on trunk and the software is stable at every point in time.

Lee:
And the unit tests are with the implementation?

Pascal:
Absolutely.

Lee:
And then the integration tests, or is that a separate package or application?

Pascal:
No, it’s all together. Some of the integration tests are part of the continuous build, and I think the other integral part to continuous deployment is what Eric Ries calls the immune system, which is basically automated monitoring and automated checking of the quality of the system at all times. One of the things in our continuous deployment system is we release one machine, we let it bake for some time, then we release two more, we let that bake for some time, and then we release four more.

Lee:
So the software is compatible with being co-deployed with new version versus old version?

Pascal:
Yeah. Forward/Backward compatibility is a must at every step.

Eishay:
We need to do self-tests of the services. One service starts and has a small self-test. It checks itself. If it fails its own self-test it means it’s not configured properly, it can’t talk with its peers or for any other reason, then it will rollback.

Lee:
So you have a bunch of assert statements that once the software boots it has these things that are critical for it to run and it checks it?

Eishay:
Right.

Pascal:
For instance, the portfolio management system starts. Can it get prices for Google, Apple, and IBM? If it can’t, it shouldn’t be out there.

Lee:
Yeah. If IBM changed symbol, which it hasn’t since it’s been public, but if it did then you’d have to go in there and change that assertion, but that you would know that almost immediately.

Pascal:
Yeah. We actually have – digressing a bit into the the realm of financial engineering – our own symbology numbering scheme to avoid those kind of dependencies on external references. Our own instrument ID. It’s kaChing’s instrument ID. Everything is referenced that way. At a high level, we really try from the start to protect ourselves from external constraints and having external conversions between the external domains and our view of the world at the boundary, at the outsests of the system so that within our system everything can be as consistent and as modern as we want. I’ve seen many systems where external constraints were impacting the core and making it very hard to iterate.

Lee:
Which makes for a tightly coupled system. So you’re trying to make the coupling looser between computing entities.

Pascal:
Yeah.

Lee:
There’s just a lot of stuff we’ve talked about just in three minutes. Two simple words: continuous deployment, and a lot of investment in technical capabilities underneath that. A lot of investment in build automation. A lot of investment in how the application is architected such that it can be co-deployed and co-resident with multiple versions. You have integration testing, unit testing, scale and capacity in running those tests a lot and randomizing, and then you went into monitoring, right? So I mean to do a continuous deployment system in an organization that is pretty large, that would probably be a pretty big transform, but since you guys are doing it right from the get-go, right from the start…

Pascal:
I think it would be extremely hard for an architecture, for a company culture that is not driven by test-driven development to then shift gears and decide “In one year we’re going to do continuous deployment. We need to hit all those milestones to get there.” It’s a very difficult company culture to shift. Everything in our engineering processes are geared towards having a system at every version, having multiple versions in production, having no down time, database upgrades that are always forward and backward compatible. It’s really ingrained in many things, and to be able to help engineers work at every level, to be able to achieve all the different little things that require seamless continuous implementation is quite hard, which is why obviously doing that from the start is much easier. I think it would be very hard to get a large project and shift it to doing CD.

Eishay:
TDD or test development is the fundamentals. It’s at the core. If continuous deployment is the cherry on the top, the TDD is the base. Without that you just can’t…

Pascal:
And something that one of our engineers, John Hitchings, commented about was our view of testing. He was saying, “At my previous company I would be doing testing to make sure I didn’t do any blatant mistake”, but at kaChing I do testing to make sure that I’ve really documented all of my feature and protecting my feature from people who are coming next week and changing it.

Lee:
Is that behavior driven development?

Pascal:
It’s really writing specs as tests. If I write a feature and then I go on vacation, anybody in the company should be able to go and change it. I need to be able to document it in code sufficiently well to enable my peers to come and change it in a safe way. The tests aren’t there to help me. The tests are there to protect my little silo of codeIt’s a very different approach on testing.

Eishay:
We don’t have any dark code areas in our system. Anyone can get into our system and do major re-factoring with the confidence that if the tests pass then it’s okay. Of course he has to test anything he adds, but I can be pretty confident that I can move classes from one place to the other or change behaviors and if the test passes it’s good. In other places I know in other companies I used to work at, there’s a lot of places that the developer will have a piece of code, left the company or is working on something else, and this part of code is locked. Nobody can touch it, and it’s very scary.

Lee:
Fragile. Yeah. Let’s talk about confidence. So I walked in, it looked like you kicked off a build and that was running. Do you think it’s probably already in production right now while we’re sitting in the conference room talking about it?

Eishay:
Oh yeah. We say five minutes, but in practice it’s probably more like four minutes since I commit something and it’s out there. It could be very typical that I do three or four commits, change 20 lines of codes here and there, and oh, it’s in production. It’s not uncommon that we have 20, 30 releases of services a day.

Lee:
That’s incredible. I think most organizations are happy with a week or every other week and you’re doing it 20 times a day. That’s something to really be proud of, and by the way, I was really glad to see David Fortunato’s blog. There’s a lot to be proud of. That’s just a great post.

Eishay:
Thank you.

Lee:
It’s really good.

Pascal:
We’re talking a lot about the technical aspects, which are a lot of fun, but at the business level we’ve been able to launch kaChing Pro from idea to having clients using it in one month, and kaChing Pro is essentially Google Analytics meets Salesforce for investment managers. It has full stats. It allows the manager to have a kind of CRM system to manage all of his clientele, private store front for them to be able to on board the new clients and basically a little bank interface where their customers can come in, log in, and see their brokerage statements, trade. Being able to turn the gears so quickly on a product is really key.

Lee:
Which is the key part of the lean startup idea that Eric Ries was bringing up. The business value that the continuous deployment does, that was the best written documentation of why to do this. I’ve been thinking about continuous deployment type principles from a technical perspective to help engineers do their job. Lean Startup documents why to do CD from a business perspective. I’m glad you brought that up. That’s a really good point.

Eishay:
For instance, one of our customers contacted Jonathan, one of our business guys, and told us that something didn’t make sense in our workflow and it felt like we needed to change it now. He called the customers after 15 minutes or half an hour and asked them how it looks like right now. We can immediately change things and deploy them and check how the market reacts or the customers react.

Lee:
The way I’ve usually seen this done is that the customer has a discussion with a business person and a developer, and the developer runs a prototype next version and shows it to the customer and the customer likes it and then it takes four weeks or longer to get it into the production system. The way you’re doing it where it’s checked in, looks good to you, you put it out there and the customer is going on to a live site and seeing it.

Eishay:
Sometimes we have “experiments”, which is a term coined by Google. We’ll test full features with a select group, just like you would A/B test parts of the site… except it is for full features!

Pascal:
I’ll give you an example. I think some of the misunderstandings of pushing code to production is pushing code is equivalent to a release, when really those two things are completely disconnected. Pushing code is, well, I have inventory in my subversion. I need to get that inventory out in the store in front of customers, versus releases, unveiling a new aisle. The aisle can be there in the store, it’s just not going to yield.

So what we’ll very often do is we’ll have the next generation of our website but you need to have a little code to be able to see it, or your user needs to be put in a specific experiment and we showed that website to you, very similar to Google’s homepage being shown to 2% of its user base. You basically have the two versions of the website running at the same time and just showing it selectively. So we can, before a PR launch, have only the reporters look at the live website on KaChing.com with all of the new hype, but everybody else doesn’t see it and then when the marketing person is happy they just flip the switch and it goes public.

Lee:
Is that on separate servers and separate application or is it the same server?

Pascal:
No, same server. It’s just selectively decides user 23, you’re in that experiment. Here, I’ll show you that website.

Lee:
There’s a book called Visible Ops and it says that 80 percent of your problems in operations are changes. This can cause a conflict between your operations staff and your development staff because the change is met with great suspect. Maybe resistance, but it’s definitely going to be a point of scrutiny. Testing is probably the biggest thing that gives you the confidence that the change isn’t going to kill you. And if it did fail, you are missing a test.

Eishay:
And we’re not this type of company. We don’t have ops operations.

Lee:
I would argue you’re ops, but yeah. [Laughs]

Eishay:
We also don’t have a standard QA. Since we deploy all the time, there’s no person that looks at the code after every point.

Lee:
I think you built it – the console that you showed me functions as QA. QA runs the test and signs off on the results and you’re basically automating that function.

Pascal:
QA is classically associated with two functions. There’s making sure you’ve built per spec, specs being human readable, only humans can do that part. But then once the spec is fully understood and disambiguated then that part can be automated. I think many people kind of mix the two and don’t automate QA. You should clearly separate the two in attaining a spec that is fully understood and then making sure this is fully automated. Then you can have humans do the interesting thing, like the product manager saying, “This flow does look like what I had in my mind”, and that viewpoint was not encoded into a test. So this part will never be able to automate because from a human brain, it needs to be understood by a human brain.

Lee:
What I think I’ve learned today is in order to do continuous release and continuous deployment you have to have continuous testing. It sounds to me like if you don’t have that you’re not gonna have the confidence.

Eishay:
It drives a fundamental culture of thinking, engineering culture; which means that the engineer who writes the code, he knows that there’s no second tier of QA persons who will check that the small feature change is now good. He has to know that he must write all the tests himself to fully cover any feature change in places. At places that have formal QA, I’ve seen people change something they didn’t fully test because they know someone will later on have a second look at this feature or this change.

The first feature when it’s released you need to have a person, probably the product manager, to look at it and see if it’s right, but afterwards they’ll never look at it again ‘cause they’ll assume the engineer wrote the features. Of course the software changes all the time because we re-factor, we extract services to another machine. We do all sorts of stuff, but having no other person to do the QA makes us as the engineers do better testing, and better tests.

Q&A: Ernest Mueller on bringing Agile to operations

Q&A: Ernest Mueller on bringing Agile to operations

1

Damon Edwards / 

 

“I say this as somebody who about 15 years ago chose system administration over development.  But system administration and system administrators have allowed themselves to lag in maturity behind what the state of the art is. These new technologies are finally causing us to be held to account to modernize the way we do things.  And I think that’s a welcome and healthy challenge.”

-Ernest Mueller

 

 

I met Ernest Mueller back at OpsCamp Austin and have been following his blog ever since. As a Web Systems Architect at National Instruments, Ernest has had some interesting experiences to share. Like so many of us, he’s been “trying to solve DevOps problems long before the word DevOps came around”!

Ernest was kind enough to submit himself to an interview for our readers.  We talked at length about his experiences brining Agile principles to the operations side of a traditional large enterprise IT environment. Below are the highlights of that conversation. I hope you get as much out of it as I did.

 

Damon:
What are the circumstances are that led you down the path of bringing the Agile principles to your operations work?

Ernest:
I’ve been at National Instruments for, oh, seven years now.  Initially, I was working on and leading our Web systems team that handled the systems side of our Web site.  And over time, it became clear that we needed to have more of a hand in the genesis of the programs that were going out onto the Web site, to put in those sorts of operational concerns, like reliability and performance.

So we kind of turned into a team that was an advocate for that, and started working with the development teams pretty early in their lifecycle in order to make sure that performance and availability and security and all those concerns were being designed into the system as an overall whole.  And as those teams started to move towards using Agile methodologies more and more, there started to become a little bit of a disjoint.

Prior to that when they had been using Waterfall, we aligned with them and we developed what we called the Systems Development Framework, which was kind of systems equivalent of a software development lifecycle to help the developers understand what it is we needed to do along with them.  And so we got to a point where it seemed like that was going very well.  And then Agile started coming in and it started throwing us off a little more.

And the infrastructure teams – ours and others – more and more started to become the bottleneck to getting systems out the door.  Some of that was natural because we have 40 developers and a team of five systems engineers, right?  But some of that was because, overall, the infrastructure teams’ cadence was built around very long-term Waterfall.

As technologies like virtualization and Cloud Computing started to make themselves available, we started to see more why that was because once you’re able to provision your machines in that sort of way, a huge – a huge long pole that was usually on the critical path of any project started falling out because, best case, if you wanted a server procured and built and given to you, you’re talking six weeks lead time.

And so, to an extent there was always – I hate to say an excuse, but a somewhat meaningful reason for not being able to operate on that same sort of quick cycle cadence that Agile Development works along.  So once those technologies started coming down and we started to see, “Hey, we actually can start doing that,” we started to try it out and saw the benefits of working with the developers, quote, “their way” along the same cadence.

 

Damon:
When you first heard about these Agile ideas – that developers were moving towards these short sprints and iterative cycles – were you skeptical or concerned that there was going be a mismatch?  What were your doubts?

Ernest:
Absolutely, it was very concerning because when folks started uptaking Agile there was less upfront planning, design and architecture being done.  So things that they needed out of the systems team that did require a large amount of lead time wouldn’t get done appropriately.  They often didn’t figure out until their final iteration that, “Oh, we need some sort of major systems change.”  We would always get into these crunches where they decided, “Oh, we need a Jabber server,” and they decided it two weeks before they’re supposed to go into test with their final version. It was an unpleasant experience for us because we felt like we had built up this process that had got us up very well aligned from just a relationship point of view between development and operations with the previous model.  And this was coming in and “messing that all up.”

Initially there were just infrastructure realities that meant you couldn’t keep pace with that.  Or, well, “couldn’t” is a strong word.  Historically, automation has always been an afterthought for infrastructure people.  You build everything and you get it running.  And then, if you can, go back and retrofit automation onto it, in your copious spare time. Unless you’re faced with some sort of huge thousand server scale deal because you’re working for one of the massive shops.

But everyplace else where you’re dealing with 5, 10, 20 servers at a time, it was always seen as a luxury, to an extent because of the historical lack of automation and tools but also to an extent just because you know purchasing and putting in hardware and stuff like that has a long time lag associated with it.  We initially weren’t able to keep up with the Agile iterations and not only the projects, but the relationships among the teams suffered somewhat.

Even once we started to try to get on the agile path, it was very foreign to the other infrastructure teams; even things like using revision control, creating tests for your own systems, and similar were considered “apps things” and were odd and unfamiliar.

 

Damon:
So how did you approach and overcome that skepticism and unfamiliarity that the rest of the team had toward becoming more Agile?

Ernest:
There’s two ways.  One was evangelism.  The other way – I mean I [laughter] hesitate to trumpet this is the right way.  But mostly it was to spin off a new team dedicated to the concept, and use the technologies like Cloud and virtualization to take other teams out of the loop when we could.

These new products that we’re working on right now, we’ve, essentially, made the strategic decision that since we’re using Cloud Computing, that all the “hardware procurement and system administration stuff” is something that we can do in a smaller team on a product-by-product basis without relying on the traditional organization until they’re able to move themselves to where they can take advantage of some of these new concepts.

And they’re working on that.  There’s people, internally, eyeballing ITIL and, recently, a bunch of us bought the Visible Ops book and are having a little book club on it to kinda get people moving along that path.  But we had to incubate, essentially, a smaller effort that’s a schism from the traditional organization in order to really implement it.

 

Damon:
You’ve mentioned Agile, ITIL, and Visible Ops.  How do you see those ideas aligning? Are they compatible?

Ernest:
I know some people decry ITIL and see process as a hindrance to agility instead of an asset. I think it’s a problem very similar to that which development teams have faced.  We actually just went through one of those Microsoft Application Lifecycle Management Assessments, where they talk with all the developers about the build processes and all of that.  And it ends up being the same kind of discussion.  So things like using version control and having coordinated builds and all this.  They are all a hindrance to a single person’s velocity, right?  Why do it if I don’t need it?

You know if I’m just so cool that I never need revision control for whatever reason, [laughter] then having it doesn’t benefit me specifically.  But I think developers have gotten more used to the fact that, “Hey, having these things actually does help us in the long term, from the group level, come out with a better product.”

And operations folks for a long time have seen the value of process because they value the stability.  They’re the ones that get dinged on audits and things like that, so operation folks have seen the benefits of process, in general.  So when they talk about ITIL, there’s occasional people that grump about, “Well, that will just be more slow process,” but realistically, they’re all into process. [laughter] It all just depends what sort and how mature it is.

How to bridge ITIL to Agile [pause] — What the Visible Ops book has tried to do in terms of cutting ITIL down to priorities — What are the most important things that you need to do?  Do those first, and do more of those.  And what are the should do’s after that?  What are the nice to haves after that?  A lot of times, in operations, we can map out this huge map of every safeguard that any system we’ve worked on has ever had.  And we count that as being the blueprint for the next successful system.  But that’s not realistic.

It’s similar to when you develop features.  You have to be somewhat ruthless about prioritizing what’s really important and deprioritizing the things that aren’t important so that you can deliver on time.  And if you end up having enough time and effort to do all that stuff, that’s great.

But if you work under the iterative model and focus first on the important stuff and finish it, and then on the secondary stuff and finish it, and then on the tertiary stuff, then you get the same benefit the developers are getting out of Agile, which is the ability to put a fork in something and call it done, based on time constraints and knowing that you’ve done the best that you can with that time.

 

Damon:
You mentioned the idea of bringing testing to operations and how that’s a bit of a culture shift away from how operations traditionally worked. How did you overcome that? What was the path you took to improve testing in operations?

Ernest:
The first time where it really made itself clear to me was during a project where we were conducting a core network upgrade on our Austin campus and there were a lot of changes going along with that.  I got tapped to project manage the release.

We had this long and complex plan where the network people would bring the network up, and then the storage team would bring the storage up, and the Unix administrators would bring all the core Unix servers up.  It became clear to me that nobody was actually planning on doing any verification of whether their things were working right.

We’d done some dry runs and, for example, the Unix admins would boot all their servers and wander off, and half of their NFS mounts would hang.  And they would say, “Well, I’m sure, three hours later, once the applications start running, the developers testing those will see problems because of it, and then I’ll find out that my NFS mounts are hanging, right?”  [laughter].

And always being eager to not disrupt the people up the chain, if at all possible, that answer aggrieved me. I started talking to them about it and that’s when it first became clear to me that the same unit test and integration test concerns are equally as applicable to infrastructure folks as application folks.  For that release, we quickly implemented a bunch of tests to give us some kind of idea what the state of the systems were – ping sweeps from all boxes to all boxes to determine if systems or subnets can’t see each other, NFS mount check scripts distributed via cfengine to verify that.  The resulting release process and tests has been reused for every network release since because of how well and quickly it detects problems.

It’s difficult when there’s not a – or at least we didn’t know of anybody else who really had done that.  You know if you go out there and Google, “What’s a infrastructure unit test look like,” you don’t get a lot of answers.  [Laughter].  So we were, to an extent, experimenting trying to figure out, well, what is a meaningful unit test?  If I build a Tomcat server, what does it mean to have a unit test that I can execute on that even before there’s “business applications” applied to it?

I would say we’re still in the process of figuring that out.  We’re trying to build our architecture so that there are places for those unit tests, ideally, both when we build servers, but we’d like them to be a reasonable part of ongoing monitoring as well.

 

Damon:
How are you writing those “unit tests for operations”? Are they automated? If so, are you using scripts or some sort of test harness or monitoring tool?

Ernest:
So right now the tests are only scripted.  We would be very interested in figuring out if there’s a test harness that we could use that would allow us to do that.  You can kind of retrofit existing monitoring tools, like a Nagios or whatever to run scripts and kinda call it a test harness.  But of course it’s not really purpose-built for that.

 

Damon:
Any tips for people just starting down the DevOps or Agile Operations path?

Ernest:
Well, I would say the first thing is try to understand what it is the developers do and understand why they do it.  Not just because they’re your customer, but because a lot of those development best practices are things you need to be doing too.  The second thing I would say is try to find a small prototype or skunkworks project where you can implement those things to prove it out.

It’s nearly impossible to get an entire IT department to suddenly change over to a new methodology just because they all see a PowerPoint presentation and think it’s gonna be better, right?  That’s just not the way the world works.  But you can take a separate initiative and try to run it according to those principles, and let people see the difference that it makes.  And then expand it back out from there.  I think that’s the only way that we’re finding that we could be successful at it here.

 

Damon:
Why is DevOps and Agile Operations becoming a hot topic now?

Ernest:
I would say is that I believe that one of the reasons that this is becoming a much more pervasively understood topic is virtualization and Cloud Computing.  Now that provisioning can happen much more quickly on the infrastructure side, it’s serving as a wake-up call and people are saying, “Well, why isn’t it?”

When we implemented virtualization, we got a big VMware farm put in.  And one of the things that I had hoped was what that six-week lead time of me getting a server was gonna go down because of course, “in VMware you just click and you make yourself a new one,” right?  Well, the reality was it would still put in a request for a new server, and it would still have to go through procurement because, you know, somebody needed to buy the Red Hat license or whatever.

And then you’d file a request, and the VMware team would get around to provisioning the VM, and then you’d file another request and the Unix or the Windows administration team would get around to provisioning an OS on it.  And it still took about a month, right, for something that when the sales guys do the VMWare demo, takes 15 minutes.  And at that point, because there wasn’t the kind of excuse of ”we had to buy hardware” left, it became a lot more clear that no, the problem is our processes.  The problem is that we’re not valuing Agility over a lot of these other things.

And in general, we infrastructure teams specifically organized ourselves almost to be antithetical to agility.  It’s all about reliability and cost efficiency, which are also laudable goals, but you can’t sacrifice agility at their altar (and don’t have to).  And I think that’s what a lot of people are starting to see.  They get this new technology in their hand, and they’re like, “Oh, okay, if I’m really gonna dynamically scale servers, I can’t do automation later.  I can’t do things the manual way and then, eventually, get around to doing it the right way.  I have to consider doing it the right way out of the gate”.

I say this as somebody who about 15 years ago chose system administration over development.  But system administration and system administrators have allowed themselves to lag in maturity behind what the state of the art is. These new technologies are finally causing us to be held to account to modernize the way we do things.  And I think that’s a welcome and healthy challenge.

 

Videos: Jesse Robbins, Ezra Zygmuntowicz, Colleen Smith at Cloud Connect 2010

Damon Edwards / 

Here’s another round of “3 Questions” interviews that I shot at Cloud Connect 2010 in San Jose, CA on March 17, 2010.
 
Jesse Robbins (Opscode / Chef), Ezra Zygmuntowicz (Engine Yard), and Colleen Smith (Symantec) were asked:
1. What brought you to Cloud Connect?
2. What aspect of the Cloud excites you the most these days?
3. Wildcard question!…
 
Jesse Robbins is the CEO of Opscode and one of the creators of Chef.
Wildcard question: How does “infrastructure as code” unlock the promise of the Cloud?

 

Ezra Zygmuntowicz is a Senior Fellow and Co-Founder of Engine Yard.
Wildcard question: What tooling changes are needed to make DevOps a reality?  

 

Colleen Smith is an Information Technology Architect at Symantec
Wildcard question: How will Clouds impact the culture of internal enterprise IT?

 

Thanks to all for playing along!

Page 5 of 6First23456