November 3rd, 2017
The following is a write-up of my talk "The Razor's Edge - Cutting Your TLS Baggage", given at O'Reilly Security 2017 on November 1st, 2017. The slides are available from slideshare or here; a video recording of the talk is available here.
This is a TLS talk. But not the super-exciting talk about a shocking (!) vulnerability or the break-through finding and presentation of a brand-new method of proxying your private key or anything cool like that. This is going to be a boring talk on the boring banality of fiddling with the various bits and knobs to get your serving stack to actually do TLS correctly and to strengthen your TLS posture.
If this puts you to sleep, don't worry: I already lied to you. The TLS bits were just to get you in here; in reality, this is a culture talk about how to effect change in a large organization. And because I know that 80% of you are busy scrolling through the Twitters while sitting here anyway, I'm gonna play a little Pop-Up Videos and run a bunch of unrelated tweets alongside the talk. Follow me on Twitter, vote Quimby!
I like boring. Boring is good. Boring technology, boring practices, boring routine. (Oddly enough BoringSSL is too exciting for me, so we stuck with boring old OpenSSL.) Boring slides, boring fonts, boring job...
Over the last 18 months or so, I've been working on unifying the HTTPS serving stacks of a boring old company named Yahoo. This warrants mentioning to provide some context for the scope of the problem: Yahoo has a lot of everything: lots of users, lots of different properties, lots of acquisitions, lots of page views, lots of data being streamed from lots of places. And yes, Yahoo also has lots of different software with lots of vulnerabilities, creating a lot of attack surface.
Ah, I can almost hear somebody say "you know what Yahoo has a lot of, too? Security incidents and compromises!" To which I say two things: (1) having some of the more visible compromises in the world behind you can be used in your favor, for example when you want to get executive buy-in for large scale initiatives to protect your users. Make lemonade. And: (2) so do you.
No, seriously, you can't smugly proclaim that everybody else is either compromised or doesn't know it but not conclude that this applies to you as well. Are you currently aware of your company having been hacked? No? Well, then I guess you fall into the second category.
Anyway, looking at out data centers and POPs, that's a lot of attack surface right there. One of the things where we were hurting a bit was all the variance in how our traffic was served. Some properties might serve directly from the data center, some might serve from our internal CDN, some from our internal Platform as a Service (PaaS), some from any possible combination thereof.
Over the last few years, we've frequently used this graphic illustrating the attacker life cycle to help explain to the rest of the company what our top priorities are:
You all probably know this, right? The attacker life cycle, showing recon, initial compromise, lateral movement and privilege escalation, and exfiltration. These five stages also give you your key priorities. Work on those, and you're on the right track. The other thing to keep in mind is, of course, your threat model. At Yahoo, we are in the fortunate position to no longer have to argue that certain nation state attackers are strictly within our threat model. We have enough users and push enough traffic to have this fairly big target on our backs, and we believe with reason that our adversaries will use the sort of attacks against e.g. weakly configured TLS services that you hear about in articles accompanied by an evil hax0r wearing a ski mask. (Know your enemy!)
In this talk, we will focus primarily on stage one and two of the attacker life cycle. In particular, our objective was to unify the ingress traffic serving stacks, to strengthen our TLS posture, and thereby to reduce our attack surface and to significantly raise the cost of attack for our adversaries. Over the course of now almost 18 months, we have made great progress here and...
Wait, what? 18 months!? Why on earth did this take so long? Whatever happened to "move fast and break things"? Well, it's not like our motto is "move slow and try not to break things even though you're still gonna", but given our large infrastructure's legacy and history, it turns out that we couldn't just prescribe our desired settings and call it a day.
We had to overcome assumptions about how our services work made over 23 years and convince a number of teams to give up control of their ingress service. With this in mind, we began outlining our initial requirements, accounting for exceptions, a slow migration path, and leaving room for incremental improvements.
Fundamentally, we wanted to reach a point where all traffic coming into Yahoo was served in a consistent manner, using a very limited set of OS and software versions, a common set of ciphers and protocols, and using certificates that meet the same common requirements. This would allow us to more easily make changes going forward, to push out patches or upgrades, to add or drop ciphers, and so on.
One important thing to call out here is that we have very specific, technical requirements (actual ciphers, session ticket key rotation mandates, etc. etc... booooring) as well as general, tactical requirements (also boring, but no, wait, actually, those are the interesting ones. If you want to accept end-user traffic from the internet at Yahoo, you need to meet all of the requirements shown in the slide.
So far, so good. Hopefully this all makes sense. The tactical requirements seem like no-brainers, so why do we bother calling them out?
We found that calling out the "obvious" explicitly has proven to be tremendously useful in future-proofing our stacks. Things that are obvious to you and me may not be that to others. And we really wanted people to explicitly sign off on these basic requirements so that we can actually hold them to compliance here. Without being explicit here, we'd soon enough face an argument about whether a property should continue to run FreeBSD 4.11.
Now looking at the requirements, we also had an idea about what we want our stacks to do specifically, from a TLS point of view, anyway. Or rather, we had an idea about what we did not want them to do. But specifying that is effectively a blacklist, and you all know how well those work when it comes to security. So we went with a whitelist instead: instead of saying "you can't use RC4", we say "here's the full list of ciphers in the specified order that you must offer".
Now TLS is a complex beast, and figuring out which ciphers you want to offer in which order involves making tradeoffs between user clients you wish to support and the risk to your users from you supporting crappy, old, insecure ciphers and protocols. So to determine what we prescribe, what we want, we first had to understand what we have.
To the metrics chopper!
A pre-cursor to this program was an in-depth analysis of all certificates, SSL and TLS protocols, cipher capabilities, key-exchanges, etc. found across all our systems, including internally and externally exposed systems, browser- and service-oriented services, and everything in between. We found all sorts of weird things, which lead me to conclude that we're effectively looking at a mirror image of the internet.
Any sufficiently large infrastructure is indistinguishable from the internet
Now as I said, this includes internal systems, non-http systems, proprietary vendor products, IoT thingies, ... the whole shebang. Certificates and stacks exposed to the internet and serving consumer traffic were quite a bit better here, but nevertheless, this painted a fairly bleak picture and really underscored the need to unify our TLS story.
But that data only tells one side of the story, too. It's our inside view. We also need to know what our users are doing, so we started logging the TLS ciphers negotiated by our clients in the different markets. Even bucket sampling here has proven to be incredibly useful in identifying whether or not we can drop a given cipher or protocol.
Here we see TLS protocols supported by our stacks Note the persistent use of TLS1.0 in here (and just about everywhere on the internet); TLS 1.0 was defined in 1999, isn't all that great and is verboten for PCI environments, yet it's still supported just about everywhere.
Note also the long, long tail on the SSL deprecation. Even though we declared SSL forbidden right after POODLE, it really only started to drop later in 2015, when we had begun the initial push for unifying the edge.
This is where the problems of scale come in: One in a million is next Tuesday.
Declaring SSLv3 dead and forbidden seems like an easy thing to do from an information security perspective, but if that had resulted in dropping 30% of our users and revenue, then we really couldn't have done it. And that's the thing about scale: if you have a lot of everything, then even less than one percent can still be millions of users. What's more, TLS clients are deployed unevenly across global markets, and in some geographical areas, there are ancient clients unable to speak modern ciphers or protocols, and you have to decide if you're willing to cut them out completely in order to not allow the rest to be dragged down.
The lowest common denominator really is very low. And at scale, you can always go lower. It's a limbo contest! For example, there are TVs sitting in people's homes that were produced in, say, 2009, and that can only talk TLS1.0 with AES128-SHA1, and that we have contractual obligations to support for several years into the future...
But so we looked at what clients negotiate, what vulnerabilities there are in which protocols and ciphers, how difficult they are to exploit, and then derived a cipher spec that we want to prescribe:
In general, we prefer ciphers that provide forward secrecy over those that don't; we prefer GCM over CBC; we prefer faster "good enough" ciphers over slower, stronger ones. But we're also missing a lot of good things:
Up until recently, Apache Traffic Server, which Yahoo donated to the Apache Foundation, and which we use heavily, did not have support for multiple certificate chains (and there are still problems when using OCSP stapling in these circumstances), which means that we could not serve ECDSA certificates for those clients that support it. We hope to be able to do so soon.
Similarly, we are also still running OpenSSL 1.0.1, meaning we are missing out on ChaCha20/Poly1305; again, we hope to be able to move to OpenSSL 1.1.x soon.
But I also lied to you. The ciphers I mentioned are not actually the ones we serve; we also allow DES-CBC3 ciphers, which, after Sweet32 really should be disabled...
Alas, it turns out that there are certain load balancers that -- in the revisions we have certified for production -- only support DES-CBC3 for healthchecking. Which means that we either need to healthcheck different configs from the ones in production, or enable those ciphers. As I said, the lowest common denominator is low...
(Note that we are not only prescribing the set of ciphers, but we also prescribe the order of ciphers. We are monitoring and will complain if we encounter services using the same ciphers but in a different order. Similarly, we also have requirements around the use of Forward Secrecy, such as custom DH params and session ticket key rotation. One of the tools we use for this is cipherdiff, to check and identify discrepancies in the ordering of the ciphers.)
But ciphers are really only one part of the equation. To serve HTTPS, we also need to look at the properties of our certificates (which CAs to use; names, domains and wildcards; validity; signature algorithm; type of certificate etc.), as well as the various security related headers and enabling certain security protections statically in the browsers where that is possible.
(One of the tools we use to verify compliance with our certificate requirements is certdiff.)
As before, most of these seem like a no-brainer. As before, it's useful to be explicit. Our biggest win here was the push for shorter validity. The longer a certificate is valid, the higher the probability that it will be compromised.
Looking at the data over the last two year, we see that we have actually made some progress, although that is only visible over the long term. And this is one of our biggest wins. We're not where we want to be, but the trend is clearly encouraging: we went from average certificate validity of over 2 years to average validity of less than a year!
But 6 months validity is not the end goal. We want to go much, much shorter. But we can't do that until we have suitable automation in place. So with 6 months, we are intentionally introducing a little bit of pain so as to get us to prioritize automation.
You see, with a 1 year validity, a painful renewal process can be absorbed by the properties (with some muttering and cursing), and will get them to likely request 2 or 3 year valid certs. With 6 months, we have enough pain to have developers complain, but not so much pain that we have a revolt.
So the 6 months validity is an incremental step towards the end goal. This incremental approach is something we have also learned from product development; you all have probably seen this illustration of the minimum viable product, in which we make small, incremental changes in order to reach our final objective.
In information security, we have an analogous model: the minimal viable protection. We so often try to secure all the things, and as infosec nerds, we can trivially poke holes into any defense mechanism and snidely call it "ridiculous" or "laughable". And it's true that in cryptography any minor flaw can be disastrous, but remember one of Shamir's Three Laws of Security: cryptography is typically bypassed, not penetrated.
So it's ok to make things better, even if we don't make things perfect. Incremental changes that slowly raise the bar are not just ok, but the only realistic approach to increasing security across the board. Learn to defend against Red Guy before you try to take on Mr. Burns.
And if you end up "only" defending against Wile E. Coyote, that's ok, too. Not everybody has to be -- or can be -- defending against nation state attackers. Know your threat model.
Alright, so we have all our requirements, but that sure sounds like a lot of work. How do we get there? For starters, we need support from the executive team. If security is not a priority set from the top down, then you are crippled in your effectiveness. Information security is not a single department, we are cat herders, we need to nudge other teams, often telling them to do work they do not necessarily see the value in. So step 1 is to get this buy-in from your execs.
But wait -- how the hell do you get buy-in from execs? Well, I'm glad I asked! Step 0 is to get pwned. (Sounds crazy, right?) Most executives are happy to "accept the risk" until they are shown (on the frontage of the NY Times) that the risk is real.
And we already established that all y'all are compromised, so this step is already done for you, how convenient. Now you just have to convince your execs...
One of the most important lessons we've learned here (over and over again) is that incremental changes get us to the finishing line much faster than trying to jump to the goal right away.
Not only are smaller changes easier to accept for other teams, you are also building a culture of constant change, of moving forward. The first change you ask for will be hard, but once things are moving, you can easily add additional (small) requirements; once developers are used to making changes here, they are perfectly able and willing to make other changes. Newton's Laws of Motion do apply in cyber space!
And incremental change is the only way we can make progress. Perfect is the enemy of the good -- we will never be able to achieve 100% security, but we can always do better.
We specifically did not aim for our end-goal, our ideal configuration; had we done so, we wouldn't have made any changes or progress. But at the same time, it is important to not stagnate, to not call the accomplishments you've made "good enough". We always want to move forward, to improve. By building a culture of incremental changes, we actually can.
One of our main dilemmas in information security is that we come in and tell others what to do. And they usually have little incentive to comply. Getting other teams to do work they have not prioritized is tricky. So you need to sell something. The old carrot, not the stick. Carrots work much better. But beware: your carrot is not their carrot. What you see as a win (increased security) makes no practical difference to other teams. So figure out how to sell them your work.
One thing that's worked for us: Deferred responsibility. Or rather, we sell not having to do the work. Instead of having 200 properties all trying to configure TLS correctly and then chase updates when the next heartbleed drops, we offer them peace of mind. "Get behind this approved cluster and sleep well. Let somebody else do it!"
That turned out to be a pretty juicy carrot: the ability to shed the responsibility of keeping track of all ingress stack requirements, to let developers and service owners do what they are paid to do, what they enjoy, what they do best, to focus on their service. The responsibility of implementing all these requirements is now carried by the small number of edge ingress stacks configured by experts in this area. That is, we're offering them the most valuable reward possible: time.
The other thing that turned out to be crucial for our success was patience. Or, more specifically, delayed gratification (on our part) vs instant gratification on our developers' part.
We often compromised with the end-goal of moving forward (you see a theme, here: incremental changes, make things better, move forward). That is, we might allow properties to onboard their service even if they don't meet all requirements if they commit to fixing things Real Soon Now. That would require a tracking ticket that is assigned to an actual human, has a reasonable deadline, and is being watched to meet.
We are infosec: we are in it for the long run. Short wins are nice, but the real impact is in the long term, so patience, young skywalker.
With these approaches, we feel that we are now in a very good position to effect actual change, to make progress, to improve our security posture. We are reaping the benefits of a lot of work of a lot of different teams. Profit!
With general adoption in place, it became easy to expand the requirements, to make changes that were (reasonably) quickly implemented by the different stack owners and getting important security protections to our users much faster. Since last year, we:
HSTS is a particularly big win, in my book. We've had intention to enable this for years, but always got pushback due to people being afraid of breaking third party content (of which we have our fair share). But looking back, I think we also went wrong about it, trying to get it to the main domains with 'includeSubDomains' set, because the thought of setting it individually across the hundreds of domains we are serving seemed too daunting.
Turns out, once we have central ingress stack requirements and services, setting this for most domains becomes much easier, and it is being rolled out across more and more properties at this time. Getting to e.g. yahoo.com with 'includeSubDomains' now seems a lot more feasible, and we may get there yet.
The last item -- HPKP -- of course is a bit funny. On the one hand, deploying it turned out to be quite the bleeding edge (HPKP-RO is only supported by Chrome and Opera) and of course just last week Google announced that they plan on deprecating HPKP altogether (which is another discussion for another time), but when that time comes, dropping the header again will be much easier for us to do here, too.
So, cool story, right? Just follow these few steps and you're done. Profit. Right? I'm afraid I will have to disappoint you. I'm sorry, we are infosec, our work is never done. Instead, you have to rinse and repeat, only hopefully you won't repeat at step 0...
We are paranoids. We fight for the users. We cannot get 100% security, but we can always do better. We have to continually improve. Which is exactly what we're doing, and with this practice of incremental changes in place, we are planning to move forward consistently and swiftly:
So coming to a close here: this talk wasn't really about TLS at all. It was about effecting change. We are building a new standard, new requirements. If nobody follows the new rules (because they're new), then they're not a standard. That seems like a Catch-22, but you can will it into existence. Call it a standard, and (with the aforementioned executive support), it will become one. "Build it, and they will come."
Now you can't fake being a requirement altogether, but you can ask for adoption going forward. One way to make this happen is by way of chokepoints and defaults. You need a place where you can apply your changes effectively -- in this case, it was the ingress layer, where we control network access. Another really effective chokepoint may be your Continuous Integration / Continuous Deployment platform. Another one may be your configuration management system.
You need to provide safe defaults. If you have a Chef recipe or a Puppet manifest that gets applied everywhere, this is where you belong, this is where you need to hook in. Provide a safe, secure, and easily expandable base configuration for these services, and people will build upon it, inheriting the correct settings.
Give me a razor and a place to stand, and I can cut the world. I mean, lever, and move. Your chokepoints are the place to stand, your defaults are the lever. Small changes have huge impact.
Engineers love to move forward. Everybody likes to build shiny new things, nobody likes to maintain old crap and fiddle with existing settings. Exploit that! It's much easier to get your security requirements in -- and get buy-in! -- early on. So begin by outlining your requirements for newly deployed services. This should get you adoption and critical mass so that you can then, sneakily, go back and say "oh, by the way, let's fix the old stuff, too".
Of course it should go without saying that you have to have your own house in order. Lead by example! You cannot have your own services in an insecure configuration, or your credibility is zilch. What's more, you need to provide the details, the information your teams need, give your teams the tools they need. In our example here, this means that we have to provide detailed sample configurations of the approved stacks, including examples of certificate procurement, key management, deployment, web server and proxy configuration with ciphers, setting headers etc.
But don't try to fix everything yourself, don't do the work for the other teams. People value something they have worked on more than something they are given (this is known as the IKEA effect). They will take more pride in keeping their stacks up to date and safe when they were the ones to make the changes. Offer them autonomy. Guide them, but do not force them.
In HTTPS and TLS, deploying changes can have rather significant consequences. Think of HSTS or HPKP, of dropping a cipher and making it impossible for millions of users to use your service.
So measure what you're about to break. Ensure you have the metrics to show you what you need to know. Visibility into your stacks and clients' capabilities are key. Shuffle the priority of your ciphers to see how many end up negotiating the one you wish to drop; set up report end points (for e.g. Expect-CT, HPKP, CSP, ...), measure, fiddle, measure again.
We are playing with sharp razor blades here, so have some bandaids ready. Have a roll-back plan. Introduce features slowly, so you don't have to roll back. (Once you cause harm and have to roll back, it will be ever so much harder to move forward again!)
But above all, be transparent in what you're doing. I know, this can be difficult for people in information security, what with our default stance of "need to know" and "least privilege". However, you can't get people in other teams on your side if they don't understand what you're doing and, perhaps more importantly, why you're doing it.
So communicate clearly, succinctly, effectively. Don't patronize, but don't overwhelm non-security experts. There's no need to send out the 67 page NIST recommendation on TLS configuration; you can probably boil it down into a two page document. And make sure to show people what problems you're having: you will be surprised, how frequently people on other teams can help you once they know what the problem is. Engage those local champions!
In closing: TLS should be boring, and these are some of the exciting ways in which we tried to be boring. If nothing else works for you, try to remember these take-aways: we can't reach 100% security, but we can always do better. We owe it to our users to continuously improve our security posture, to move forward, little by little. If you encourage others to take responsibility with your guidance, change is possible, even in as large and diverse an environment as Yahoo.
(Shout-outs and thanks to all these teams: Apache Trafficserver, Google's Chrome Team, the Mozilla Observatory, SSLLabs, the Yahoo Edge Team, and all Paranoids everywhere. You're all in my cool book. Be paranoid, fight for the user. Thank you.)
November 3rd, 2017