October 17th, 2013
Velocity NY, the inaugural New York instantiation of the popular Velocity conference, took place from 2013-10-14 -- 2013-10-16. I attended on the two conference days, and here are my notes.
The conference program is here.
All available speaker slides are here.
All keynotes and ignite talks are here.
The conference opened with a number of keynotes. The sponsored keynotes were about as bland and boring as you might expect, but the others were making up for them.
First up, Fred Wilson (@fredwilson) gave eight rules for "managing both computer systems and people systems" (you can read his blog entry listing all rules). He talked a bit about Twitter and Twitter's integration with Summize and how he observed the improvements in stability of Twitter as a result of ongoing measuring and re-architecting subsystems and the management analogy of implementing employee feedback, measuring the results and then re-architecting the people system.
One comment in his keynote triggered at least some critical responses: Fred asserted that it's easier to train a technical person to be a good manager (his example: Chad Dickerson, who moved from being CTO to becoming CEO of Etsy) than to train a manager to be (sufficiently) technical.
(I personally happen to disagree in that either approach is fraud with peril or dangerous assumptions. Both professional fields require a certain talent, and for somebody to excel in either, one needs more than just "training".)
Watch his keynote here.
"I'd rather have a hole in my organization than an asshole."
"Blameless post-mortems are the key to learning from a tech-ops crisis."
Richard Cook's keynote "Resilience In Complex Adaptive Systems: Operating At The Edge Of Failure" what I consider to be one of the best talks of the conference. This does not come as a surprise for those of you who have read his paper "How Complex Systems Fail" or who have seen his talk from last year's Velocity conference.
He notes that all of our systems are always operating at or near capacity, and how our systems' operating point moves in between the boundaries of economic failure, acceptable performance, and acceptable workload. Each of these boundaries exert pressure on the operating point, causing it to exhibit a sort of Brownian motion.
Most importantly, he points out that we should not be surprised when a system fails, but rather we should continuously be surprised that they work at all. We only figure out where the accident boundary is when we cross it; then we move back; then we cross it again until we're comfortable operating outside of that boundary, thus moving the margin, leading to normalization of deviance, redefining what is acceptable.
Watch his keynote here.
"I'm kind of an operator; I'm an anesthesiologist."
"The thing that amazes you is not that your system goes down sometimes. It's that it's up at all!"
"What we're interested in is why our systems sometimes don't fail."
"If you have to have a meeting to discuss an important topic you know that you've failed."
Verisign's sponsored keynote "Don't Compromise Security for Performance" was unfortunately canceled. I would have liked to see more security-focused content.
While nothing particularly new or ground breaking here, it was interesting to hear how a small online community dealt with the stress of hurricane Sandy. The issue of data center redundancy and how to recover from losing access to your hardware certainly makes for a good war story. One of the perhaps more interesting issues they ran into was that their DNS TTLs being set to a full day didn't help in standing up a failover solution (in their DBA's kitchen, no less).
It's also noteworthy that sometimes during an outage one may have to just step back and accept that one cannot do anything. Instead of frantically trying to chase dead ends, perhaps get some sleep.
Slides are here.
For me, the real lesson was that no matter what, things can always get worse:
(And if, for whatever reason, you're interested in my Sandy story, you can find it here.)
Alexis Lê-Quôc (@alq) gave a talk entitled "Alerting: More Signal, Less Noise, Less Pain", which dealt with the rather common problem of drowning in unimportant alerts or pages. Opening pitch is "Who here carries a pager? Who was paged today? Who was paged for something 'routine' or unimportant?"
His approach to bringing sanity to your pager involved categorizing useless alerts into the three categories "too frequently", "odd hours", and "always the same". He would then group the alerts by signature, rank by occurrence/frequency and graph them. This yielded a quantified image of alerts, allowing him to better adjust just what exactly should require a (human) response.
Slides, scripts, and data from his talk are here.
One take-away: use of R to process and visualize seems to be increasingly frequent. Perhaps worth getting comfortable with.
Steven Murawski (@stevenmurawski), Nick Craver (@Nick_Craver), and George Beech (@GABeech) from StackExchange presented a number of tools they use in their talk "Building For Operations". I'm somewhat fascinated by StackExchange, not so much because of the sites they run, but rather because they are a big Windows shop and thus entirely foreign to me. Their tooling tends to reflect this environment which seems simultaneously obvious and yet fascinating to me.
In this talk, they presented "opserver", their monitoring system and dashboard. (I was surprised to find it running on the public internet.) This seems to provide a wealth of information, though it also exhibits a strong bias to implement your own solution over taking anything else available. (This concept is not a very foreign to me, of course.)
Slides are here.
StackExchange "opserver": https://github.com/opserver/Opserver
Dave Zwieback (@mindweather) spoke on "Conditions Of Failure: Building Antifragile Systems And Organizations". Opening his talk, the AV system failed, and he was left starting out without his slides. In my opinion this actually made his talk more compelling, as I've come to consider presentation slides increasingly as mere distractions, making it actually harder for people to listen. (These problems were eventually overcome.)
Dave presented the idea of "antifragility" -- the property of a system that actually benefits from stress -- leaning on Nassim Taleb's book "Antifragile". He then drew parallels between DevOps and the human side of operations and made the important point that adding automation can make your organization more fragile (by allowing you to forget just how things actually work underneath).
Mark Imbriaco (@markimbriaco) presented "ChatOps: Augmented Reality for Ops". In it, he described how GitHub uses Chat (Campfire, in their case) for distributed collaboration and Hubot for a surprising number of operational tasks.
From within the chat any engineer can spin up new AWS instances, summon emoji pictures, change the routing tables of the border routers, configure switches, or mitigate a DDOS.
All of that without any sort of authorization, it seems. That is, if you can get on the chat (or onto 37signal's servers), you can own GitHub. As much as I like chat bots, this kind of blows my mind.
"Are there permissions built in?"
The second day opened with Dan Kaminsky (@dakami) and Zane Lackey (@zanelackey)'s keynote "Delivering Security: Faster, Better, Cheaper", which focused on applying the same tools we already use for operational visibility to the area of security and vulnerability discovery.
Dan began the keynote by outlining how, in part based on Zane's work at Etsy, he has come around to accepting fast iteration, continuous delivery and operationally agile approaches as one of the most important security mechanisms. He explained that in moving forward fast, we can remain one step ahead of the attackers as the infrastructure becomes a game where we set the rules, making it significantly harder for the bad guys get in.
Zane then gave a number of examples by which Etsy was able to measure and baseline a number of operational variances to detect possible attacks or attack vectors, such as access of dead code, SQLi attempts and the like. As an example, he cited an occurrence of all of the following items on which one might alert: SQL syntax errors in the logs, sudden appearance of DAB tables (which would not normally be included in valid queries), and a sudden spike in outgoing traffic / response. Regular operational monitoring might catch each one of these, but only the correlation of these events would trigger a security incident, as that would reflect an attack chain.
I was very happy to see a security-focused keynote at this conference (though I had wished for more security-related content throughout), and hope that we will see more collaboration amongst the information security and web operations fields.
Watch the keynote here
"Continuous deployment makes a game where you set the rules. You do not owe your attackers a fair playing field."
Phil Dibowitz (@ThePhilD) from Facebook talked about "Scaling systems configuration at Facebook", covering in some detail how Facebook uses Chef for configuration management of many multi-K sized clusters while allowing operational engineers to not have to be aware of all the implementation details.
Neat as the detailed abstraction of the sysctl.conf example are, I found the talk slim on meaningful takeaways. "Move fast and break things" is somewhat easier when you can throw a prototype system at 10K machines, and I'm sure the problems solved are interesting, but it was hard to draw directly applicable lessons.
Facebook's chef-utils are available on GitHub.
Joshua Hoffman (@oshu) of SoundCloud (now) and previously with Tumblr presented the entertaining story of "Hipster", a "fictional New York City based social blogging platform". In it, he recounted how this company moved from a hosted environment to their own datacenter, able, through careful planning, to flip the switch without tipping off their hosted provider and with no visible changes to the user (albeit with notable downtime of the service).
Joshua gave a talk on "Scalable System Operations" at Velocity 2012, covering some of the tools this fictional company may or may not have used.
Alexei Rodriguez (@alexeirrm) presented "From Artisanal to Mechanized: Operations at Evernote". It's always interesting to see how different popular companies run their ops shop, but for me the most interesting aspect of the talk was mentioned a bit more in passing: Evernote stood up a second, and completely independent operational presence in China. I briefly chatted with Alexei after his talk and suggested that this should be a stand-alone talk at a future conference. The logistics behind and reasoning for deploying your own infrastructure and the requirements for a separate legal entity operating in China, how they might deal with the government and the legal ramifications would be fascinating. I'm hoping Alexei will pick this up...
Slides are here.
Mike Fiedler (@mikefiedler) gave a very nice culturally focused talk entitled "Where do We Go From Here?" He focused on the job title of "System Administrator", how it developed and how we want to move this profession forward. His 5 Steps to Self Preservation in this field are:
Slides are here.
And finally, Andrew Clay Shafer (@littleidea) gave what I consider to be the other best talk of the conference: "There Is No Talent Shortage". He focused on the need for organizations to encourage and support learning, to enable people to progress professionally as well as as human beings. He also noted the need for ambition, the desire to "build the future". As an example, he gave three stone cutters: the first states "I'm paid to cut stones."; the second: "I use special techniques to shape stones in an exceptional way, here let me show you." The third one says "I build cathedrals."
People need a sense of purpose, have to find their cathedral. Software is no longer the crux of the business, learning (and enabling people to learn) is.
This talk was inspiring in the way it looked beyond "your job". Andrew encouraged everybody to not only do the right thing, but be the right people.
Slides are here. (There are a few previous versions of that talk at slideshare as well.)
The video of this talk is now available here.
"You are either building a learning organization, or losing to someone who is."
"People who aren't striving to learn have reached their own Nash Equilibrium."
"The organizations that build the future become 'graduate studies' in the skills they require to do so."
"Learning cannot be something that happens outside of the process. Learning is the point of the process.
And that concluded Velocity New York 2013 for me. Richard Cook's opening keynote and Andrew Clay Shafer's final talk nicely book-ended the conference with some excellent material. Velocity is a great conference, with great speakers, and the inaugural New York version did it justice. I necessarily missed out on a number of great talks, and I would have liked to see a Twitter presence here.
Listening to many skilled speakers, I've once more noticed a number of conference attendee and speaker peculiarities on which I may reflect in more detail another time.
Velocity New York was a great conference, but as much as I enjoyed the conference, I once again have to note that there is a very distinct echo chamber effect: many of the talks presented kept reinforcing what I expect to be most attendants' already solid beliefs and not present ground breaking new ways forward, and the speaker and organizing committee makeup does feel at times like a bit of a tightly knit club: if I heard correctly, there were several speakers who were also program committee members, which I would have thought to be a conflict of interest. I'm looking forward to a more diverse program next year -- and you, yes you, could help make it so!
October 17th, 2013