Writing software is hard. Fortunately, or so we choose to believe, there are a lot of skilled people who do this for a living, holding degrees in software engineering and following industry best practices. This belief doesn't make writing software any easier, but it sometimes feels like it does. Or should. Anyway, for the time being, let us pretend that "real" software projects are usually written by teams of software developers, which then abide by certain guidelines, coding standards etc.
System Administrators and other engineers who view their primary job description not so much to be bound by mundane (and ultimately meaningless) criteria such as "lines of code written" or "bugs fixed" tend to write a lot of software, too. Often times the software they write consists of little helper tools, glue scripts, small programs that are mainly used to interface with other, more complex software systems or to munge data into a new format.
These system tools rarely consist of more than just a few hundred, sometimes perhaps up to a few thousand lines of code. In other words, they're rather small, and most of the time we think of these tools as trivial, even as we take pride in the solutions we've come up with. But because we do not consider these system tools to be "real" software projects, we tend to quickly jot them down, script them. (The name "script" almost seems to imply "no error checking" and "works for me, who knows if it'll work for you".)
But therein lies a fallacy. Simple does not mean unimportant. If our little glue script is actually useful, it will quickly sprout new features, adjustments for other environments, be used by other people, integrated into routine tasks and eventually relied upon. I've argued before: software is alive; it grows and ultimately escapes your control. Scripts grow into programs, which in turn grow into infrastructure components or stand alone software products.
In my class on System Administration, I now dedicate almost two complete lectures to the topic of writing system tools. In these lectures, I preach some of the basic software engineering principles that we expect developers to apply to their large scale projects. (The lectures are based in part on a talk I have given a few times before -- and the talk based in part on the lectures -- entitled Building Better Tools.)
But students are a rather receptive audience for this message (well, the good ones, anyway). In practice, I have found that even experienced system engineers eschew these methods or principles (while simultaneously paying lip service to them), as they fall into the "it's just a quick script" trap. We are notoriously lazy, and once you have a working tool, why bother polishing it and developing it into a more finished product? As much as perfect can be the enemy of the good and as much as we have to be cautious of creeping features, in this case "good enough" acts as the main inhibitor and... simply isn't.
Within the context of developing system tools, I usually differentiate amongst three distinct phases or categories:
I view the act of "scripting" as more or less a proof of concept phase, a solution to a problem that we don't anticipate to be used by other people. The very language we use ("throw together a quick script", "let me whip up a script") suggests as much. The script we produce ends up making lots of assumptions: about the user's environment, about the intended usage, about the input, and so on. All of these assumptions are implicit, however.
Scripts are frequently written in shell (bourne shell, usually, or "in bash" (more on how that parlance rubs me the wrong way another day)), though of course similar code can be written in any language.
Scripts tend to be used for very simple tasks, and they often times rely heavily on the environment. That is, they expect to find certain helper tools in the user's PATH, assume the ability to write to the current working directory or have access to certain files, for example. Most of the time, they are really only suitable for use by the person who wrote them -- they may work for other people, but often times they also break down when the various assumptions made do not hold any longer.
The most important aspect of any actually useful script is that it invariable evolves into a larger program. Almost every time I "whip up a script", I later on end up wishing I had developed it more carefully. Software is alive. Scripts grow new features; we make changes to no longer rely on the assumptions we could make when we were the only user; we fix bugs and increase robustness. And all of a sudden, our little scripts grow up to become...
In the evolution of software development, programs are the mudskippers, the next step after the wonky-eyed flounder that is the script. It's what naturally happens when your scripts are actually used -- they become useful.
Frequently, a program makes use of basic frameworks or common toolkits, software libraries or modules (think import antigravity). In contrast to a script, a program provides a more consistent interface, accounts (hopefully) for differences in the environment and is actually able to handle more than just trivial or simple tasks. Programs range from a few hundred to a few thousand lines of code; they may consume or provide an API to a service and interface with various other components without human interaction.
(At this time it might be worth noting that I use the terms "script" and "program" in a rather language agnostic way. That is, you can very well "script" using perl, python or ruby, and, yes, you can actually create complex programs using shell.)
Programs and scripts are, for the most part, what many of us Systems Engineers, System Administrators and other operationally targeted developers are creating. We put tools together that range from trivial to moderately complex. The target audience for our tools are people like us, our peers, possibly (depending on the size of the environment) other people within our organization but rarely, if ever, outsiders with entirely different environments or requirements. (Even if we open source our tools, we frequently remain our only consumers; on the flip side, we are willing to deploy into production code that we are not willing to stand by in public (see Open Source at Netflix) -- but that is another story and shall be told another time).
Most of our tools are used to bind with others, to function as "glue"; on a few rare occasions, we may, however, find ourselves maintaining a given tool for more and more users, ultimately realizing that we ended up owning a full-fledged "software product".
Within the confines of this blog post, let us ignore full-fledged software projects that have, from the beginning, been conceived as such and have received funding and man-power. How does a piece of software evolve beyond the state of being a mere "program"?
Once a piece of software has grown sufficiently, has attracted enough and more importantly sufficiently diverse users, we may realize that what we produced so far really only was a prototype. A useful prototype that helped us actually understand the problem it was trying to solve, but a prototype nonetheless.
At this stage, we start to build out an actual "software product", with a clear vision of features, requirements, measurable goals, specifications etc. etc. Usually, this stage of development requires a dedicated team that is able to not only carry out the development itself, but that can provide the necessary support and maintenance of the product throughout its life cycle (commonly estimated to be ~75% of the total cost of ownership -- I've argued that operational support makes up the remaining 75%).
Again, the focus of this blog entry is on Systems Engineers and Administrators who (if your organization does things the right way) may play a role in the design of the application, but who are not the primary developers. I mention this third stage of software evolution only for completeness's sake.
Alright. So. How do we build tools? Over the years and with some experience in small and large scale environments, I have come to the conclusion that the overwhelming majority of Systems Engineers and Administrators remain "stuck" (largely unconsciously) in scripting mode, even when they are programming. This is particularly obvious when you review any given software vendor's install scripts, smaller open source projects or, even better, your very own tools and helpers.
Portability is frequently ignored: in most cases, the author can assume a specific environment largely under their control; the user interface is unpolished: all expected users besides the author are assumed to be "experts" or to follow "here, run this command" instructions; the software is provided and installed by manually downloading it and/or stashing it into ~/bin; supporting documentation is non-existent.
These issues may seem like pedantic pet peeves, but I firmly believe that we should write our own tools according to the same quality expectations we hold other system software to. (My background in this area is the NetBSD Project, where a focus on quality and clean design, both in code and user interface, remain the defining features. Participation in this project has taught me a number of very important lessons -- but that, too, is another story and shall be told another time.) Ideally, I would like all my tools to be suitable for inclusion in an operating system's default image. This ideal forces me to focus on simplicity and quality and is something I've found to not necessarily be on the radar of many systems engineers.
Viewing a tool as an inherent component of a complete system changes how you perceive a number of rather basic features or issues. Some of them include:
From a technical point of view, the single most useful quality of any script or program is idempotency. At the same time, a lot of our system tools fail in this regard. All too many tools perform one-time actions that cannot be repeated (without failure) or only if all conditions are just right. Idempotency in turn implies predictability and graceful failure. In fact, a lot of different "best practices" or other technical features of any given program or script can ultimately be summed up with this requirement:
Your tool should only and always cause a defined set of outcomes. Running it repeatedly will either always fail or always succeed.
To illustrate this requirement, consider the trivial example of error checking (or rather, the lack thereof). When we "whip up a quick script", we tend to run a series of commands without explicitly checking whether or not the previous command succeeded. A common offender in this regard is changing the current working directory and then performing operations on files specified via relative paths. Other examples include proceeding to operate on variables that have been assigned the output of commands that may very well have failed.
In most cases, when the author runs the tool, everything will go more or less according to plan, but when executed in another environment, things will break. Files may not be readable (leading to either open permissions being applied or the tool being run with elevated or changed privileges), commands may not be found (yet the script continues), etc. etc.
The return code or output of any function or command that can fail needs to be checked.
No, this isn't news. And if asked, most people will nod and agree and then move on to ignore this rule entirely. More generally speaking, we want all our tools to
Fail early, fail explicitly!
This means that errors should be detected early, handled gracefully and quickly delivered to the user rather than unintentionally causing surprising changes to the system. Silencing errors is almost always a terrible choice, unless the errors are expected and handled appropriately. Again, this is not a new insight, but easily dismissed when one "just wants to get ones job done".
Since we just wrote this little script to scratch our particular itch, we know perfectly well how to use it. For the most part, it probably doesn't accept any command-line flags and one has to know the correct way to invoke the tool -- how many arguments, in which order -- by heart. And so we tell the next person: "To frob the hobknobbin, just run ./hfrob mumble file strunz. If you want it to stroke the grunzbugen, run ./hfrob grunz file instead."
And what do we do when we run somebody else's script and it inevitably fails? We look at the source and attempt to figure out what on earth the tool actually expects from us. The funny thing is: in a few weeks or months, when we want to use our own tool, we have to do the exact same thing for ourselves.
So yes, even for a little helper tool, please do implement basic command-line option parsing and a terse usage option.
While at it, also add an option to print out what the tool is doing as it's doing it (I'm partial to multiple -v flags to increase verbosity) and a flag to run the tool without making any changes (-d for "debug" or "don't", for example). This makes debugging and understanding somebody else's tool so much easier! As an added benefit, this means that you can change your tool to no longer print diagnostic messages by default, which can make post-processing the output annoyingly complicated; that is, only print useful and desired output. Don't bother the user with pointless "now doing X" or "still running Y" messages (though using syslog(3) to log such messages dependent on a specified log level is, in turn, good practice!).
This, then, brings us to the other major headache I've found in many custom system tools: rather than operate on stdin and stdout, we write tools that require an input file and generate data to either an output file, submit it to a web server, send it via email or who knows what. This is a terrible inflexible approach, and of course a violation of one of the tenets of the Unix Philosophy:
"Write programs to handle text streams, because that is a universal interface."
Again, not new! And engineers will be quick to dismiss and ridicule other tools they encounter that violate this principle, yet the next time they write a program to process data, they will start by opening an input file and end by sending an email alert as the result.
Not operating on stdin and generating output to stdout (and, of equal importance, error message to stderr and not stdout) makes the tool not only rigid and less suitable to expansion, it makes it impossible to use as a filter (think sort(1), uniq(1), ...) and it significantly complicates debugging and testing of the tool.
When you write a new tool, even if it's just a little script, please consider:
When we write a new tool, we usually start out by, well, writing the new tool. We script, we program, we hack, we run, we debug and finally we have a working prototype -- only, we don't call it a "prototype", but rather the final product. Either way, what we don't have is any documentation for our tool.
Once again, all System Administrators and Engineers will likely happily agree that having a manual page for any tool they have not written themselves is wonderful if not mandatory. But how many do you know who actually routinely write manual pages for all their tools?
The reason we have so few accurate manual pages for our custom tools goes back to not being used to writing software for people outside our own group; much like the lack of a -h command-line option, we assume all possible users will know how to use the tool. In addition, writing a manual page is much less exciting than writing code, and if we only start writing it after the fact, then it's easy to understand how it feels unimportant or a nuisance.
For this reason, I advocate starting the development process with the manual page. That is, before you write any code, Write The Fine Manual. This approach helps me clarify just what exactly I want to write, and more specifically how the tool will interface with the user. As a habit, I include an EXAMPLES section with common invocations. This is tremendously helpful for the end-user later on and again helps me clarify how the tool is to be used.
Furthermore, I also am in strong favor of writing an actual manual page; generating *roff from inline comments may be better than nothing, but the end-result has a distinctly unpolished feel to it, in my opinion. What's more, I've found that keeping code and documentation in one place to not actually be as conducive to keeping the documentation up to date as is commonly argued. (Remember, we're talking about smallish system tools here, not large-scale software projects with complex API documentation generated from the specially formatted comments around the various code entry points.)
You may notice that this idea goes back to the concept of thinking about your tool as being suitable for inclusion in an Operating System distribution. If you include a manual page, you also get the benefits of a manual page index and ease of accessibility -- something that is lost entirely, for example, by formatting help into a pager from the executable at run-time.
"But wait!" you may say, "A manual page for my little script seems total overkill." -- to which I must reply "fallacy of scripting". Software is alive. Your script will grow more complex, others will use it, you will forget how to use it in a few weeks or months, and then everybody will go back to sifting through the source to understand what it does.
Guess what, echo(1), true(1), and yes(1) all have manual pages. Your tool is likely to be significantly more complex than those.
Every tool intended to be run by others -- no matter how small -- deserves a manual page.
Though somewhat tangential to program documentation, also see my previous blog entry about writing system documentation.
Finally, the last pain point I wish to address in this overly lengthy blog entry is how we tend to deploy system tools. For the class of software that we have been talking about here, I've found it to be rather uncommon to have actual software packages. Instead, these scripts are often times stored in a shared filesystem space or other users are advised to check them out from a commonly accessible source repository and just run them from there.
This, once more, exhibits the implicit assumption that every possible user has an identical set up. Software dependencies -- and even simple tools nowadays pull in a slew of add-on modules (be they python, perl, ruby, node or who knows what) -- are ignored completely; if multiple files are required, they might be installed with a rudimentary Makefile that copies them into a hardcoded /usr/local prefix (or /opt -- see virtually every software vendor's script), and uninstallation is effectively impossible if one does not happen to remember which files were copied where.
Like writing a manual page, writing the necessary glue for a software package (a .spec file for RPM or the suitable Makefile for pkgsrc etc.) is not very exciting. Once we have completed a tool, we are too excited to use it, to have others use it to go through the tedious exercise of packaging it for distribution. GitHub has actually exacerbated this situation -- it's so easy to just throw some code up on GitHub and tell people to simply clone the repository that we easily forget that this does not a proper software distribution make.
And yes, I do believe that every tool intended to be run by others -- no matter how small -- deserves to be packaged.
As you can tell, I could go on: how open sourcing tools is not enough; what participation in an Open Source project can teach you about software engineering (and social studies); how people equate "bash scripting" with "shell scripting" (see also: Unix? What Unix? This is Linux!); how various languages implement their own software packaging system (node npm, ruby gems, python eggs) and make consistent deployment even more difficult. But all of these are stories to be told another day.
When you write system tools, consider how they might fit into a coherent and consistent system. If it feels like a good fit, like a native tool, then I think you're in a pretty good spot. And creating such tools is ultimately much more satisfying than "whipping up a script".
July 16, 2012