For a while now, I've been collecting a lot of information about our hosts, including fairly general information such as the Operating System (OS) type and version, what kind of configuration management (CM) system it's running, what kind of add-on packages it has installed etc. etc. This provides us with a wealth of rather interesting information that is, (not entirely surprisingly) frequently at odds with what our central source of truth believes to be the case.
Performing this data gathering regularly (using scanmaster and sigsh), over the course of several years by now, I've quickly learned that data collection is trivial, but data interpretation is hard. Or rather: complex. Just compiling a bunch of information is actually useless unless you know a number of things about the way the information was collected and what not immediately obvious factors might influence the data interpretation.
To illustrate the point, let us look at our progress in migrating our hosts to (a) a specific OS (and the distribution of the major versions); (b) a specific configuration management system; (c) a given package management system version. The actual numbers (or names) here are not really relevant, but suffice it to say that these migrations need to be done on a company-wide scale, which at Yahoo! means across a lot of hosts in a lot of different environments.
When I joined Yahoo! in 2006, there were two dominant unix-like operating systems in use. At that time, there was an effort underway to move away from one and towards the other, with the rationale being that using just one OS would make deployment and maintenance of services, software and everything else significantly easier. (Let us ignore the certainly interesting discussion surrounding the validity of this argument and how it compares to the risk of monoculture, discussed in more detail here.)
Here's how we're doing today:
This graph shows the number of hosts running either one of the two dominant operating systems over the last two years or so (there are hosts running a different OS, but the percentage is too insignificant to be of interest here). The red line is the OS that we want all hosts to be running, so we would hope to see it constantly (and rapidly) increasing while the other line should be in a constant (and steep) decline.
In reality, we find that the "green" number is hardly declining at all, even though the "red" numbers are increasing. Now it appears that between 04/01/2010 and 07/01/2010 there was a significant jump: clearly, many more "red" hosts were added, so why did not the "green" number go down?
Well, migration of an operating system is hard. It actually is rather infrequently the case that a given host is rebuilt and the "red" OS installed in place of the "green" OS. Instead, it is much more likely that hosts are left alone until they die, but any new hosts deployed are given the new "standard" (ie "red") OS. If that was the case, then it would seem that during that time a lot of new hosts were deployed -- notably more so than during another time. Perhaps we had a new colo open up, a new product rolled out, a new company acquired?
Well, it turns out: none of the above. The reason the numbers jump here is much more simple, but entirely unobvious. In fact, from looking at the data, you couldn't possibly know it: during these months, the scanning system was extended to include a large number of hosts that were previously not scanned at all. That is, all of a sudden a larger total number of hosts was included in the scan, but that did not mean that either new hosts were added or that any noticeable migration did occur. Without being aware of this little piece of information, you'd more likely jump to false or at least misleading conclusions.
Now let's look at the distribution of operating system
versions. We will, of course, see the same jump in hosts around
the same dates, but we also see that some migration is possible. Here's a
graph showing the OS versions for the "red" OS,
which (in our environment,
currently) comes in three major OS versions. (Breakdown by minor OS
versions is entertaining, but meaningless in this discussion.)
The blue numbers represent the major OS version that we would like all hosts to be running, the green numbers is the previous, by now rather old and soon unsupported major version. (The red numbers are an insignificant number of ancient older versions.) The good news: we are clearly making progress in getting more hosts to run the up to date version (blue), but once again we are not seeing a real correlating decline in the older version.
Here, we can actually observe migration/updates based on hardware being retired: (most) new hardware gets installed with the up to date version, and older versions just slowly die off (eventually, hopefully). If that holds true, then we should soon see a further steep increase in blue numbers and, after a certain tipping point, an eventual steep decline in green numbers. (More on such a distribution further down.)
For our other ("green") OS, the major
versions are broken down into
three widely in use versions and a couple of oddball numbers:
Here, the green numbers are an entirely ancient OS version, the blue numbers a very old version and the magenta colors is the currently up to date version of the OS. (Technically, even that version is old, but let's pretend it's up to date; it is the one we would want the hosts running this OS to be running.) In this case, we find that we have a very nice, constant (albeit very slow) decline in the ancient version, and a meaningful correlation between the rise in the desired version and the drop in the older version (over the last few months).
Upgrades between these versions appear to be easier here; perhaps the difference between running the "green" or "blue" version and the "magenta" version is not as big as the difference between the "green" and "blue" versions in the other OS?
As another piece of information not visible from the data, but that helps explain the results once you know it, consider that an upgrade to the "blue" OS in the first version was tied to a change in configuration management system (more on that below), which of course means that adoption must necessarily be slower. Users do not want to simultaneously upgrade their OS and get in bed with a stranger, but the migration path from one version to the other is not actually significantly more complicated than in the second case.
We have a number of configuration management systems in use, most of them developed in-house. Eventually, it was determined that we'd like to have one CM system to rule all hosts (and in the darkness bind them). However, migration towards this system is complicated by the fact that it uses a different model for file management (exclusive ownership / overrides versus merging local changes, for the most part) and information from the various sources of truth is collected in a slightly different manner. That is, changing your running, working and generally happy boxes to use this system carries a certain risk.
Let's look at the distribution of our CM systems over
"Red" is the system we want all hosts to run. It was developed in early 2009 and finally had notably penetration around a year later. "Yellow" (itself derived/forked off "blue"'s codebase) was the predominant and most widely deployed system. Looking at the graph, it seems that"blue" systems were initially migrated to the "red" system at a steady pace, but again the graph is not telling the whole story. In fact, the "blue" systems happen to belong to a (very large) group that was in the process of being retired/replaced; we would have seen the same decline in those systems, even if there was no directive to move towards the "red" system.
Secondly, even though "red" was constantly (though slowly) gaining in numbers, up until a few months ago there was no noticeable drop in any of the other systems! Even now, the four or five different systems crawling along the bottom of the graph have seen no meaningful change in deployment numbers whatsoever. But every couple of months, those lines appear to drop down to or near zero, then go back up again -- what's up with that? Well, again, you wouldn't know if by looking at the graph, but it just so happens that the access credentials used to access the hosts are changed approximately every 6 months or so. The more widely deployed configuration management systems automatically pick up this change, but the handful of "less important" systems at the bottom require manual intervention. That is, the scans are running, can't reach any of those hosts, I notice the drop in numbers and open a ticket with the system owners to push new credentials, after which the scan can reach them again. Another unobvious pattern explained by circumstantial knowledge.
Next: what's the deal with the "pink" and "brown" lines? They came out of nowhere (with "pink" being even more widely deployed than "yellow"!), then merge rather suddenly; that seems rather odd. Well, the graph is giving us numbers for what we ask for, nothing more, nothing less. If our data classification is not entirely correct, then we yield misleading graphs: The "pink"/"brown" systems are actually identical (ie a combination of two CM-related technologies), but were initially categorized as separate systems. It just so happens that the "pink" component was also found (albeit disabled) on other systems. The scans were updated to look for "pink"/"brown" (separately) on January of 2011, and the miscategorization was recognized and corrected in September, explaining the sudden drop. Again, this change has no relation to the constant increase of the "red" numbers...
So how are we making progress? The answer is: very slowly. For the most part, "red" only increased in numbers due to new deployments. That is, for almost two years there was no migration happening at all, only the addition of a new system! In the last few weeks, however, we have finally seen people moving from the "yellow" system to the "red" system -- it seems we have reached critical mass, a point where running the new system is actually more common and people have an incentive to have their existing systems to match the new systems (rather than just letting them die).
Finally, a look at how updates to the package management system are
rolled out. That is, we are looking at version numbers of the package
manager (not different package management systems):
Here, "blue" was the predominant version, "brown" means "no version of this package manager at all"; "magenta" is a version that contained a number of fatal flaws and it was decided it should be abandoned. The system underwent some major overhaul and the "turquois" is the next major version, which we want to have deployed everywhere. ("yellow" is an edge case we can ignore in this context.)
Here, we actually have a graph that is not hiding anything from us. Migration to "turquois" is happening at a steady pace -- again, the increase is steeper than the decline in the others since many new systems are spun up with the new version, but existing systems are updated much slower. This graph also illustrates how much easier it is to push the update of a single, well-contained system (the package manager) compared to something as complicated as a configuration management system, an OS upgrade or (as noted above) a combination of both at the same time.
I look at these graphs as I generate them on a monthly basis. I find it interesting to see how they illustrate how difficult it is to make a global change on the scale of Yahoo!'s environment. Progress, initially imperceptible, happens at an excruciatingly slow pace. I believe that there are a few reasons for that: on the one hand, Yahoo! is very conservative when it comes to making such changes. That is, stability is valued over agility. On the other hand, the environment just happens to be so diverse with so many different products being involved that any change must take a long time.
Being the one collecting the data, I happen to have a much better understanding of what the simple graphical representations actually means than most other people who look at them. And it has taught me that collecting data is trivial, data analysis complex. Data collection is meaningless if the surrounding factors are not known or understood, and that is something that I'm afraid many people overlook. Instead, they (unconsciously) interpret the data to mean what they want it to. So... Know thy unknowns!
October 21, 2011