GNU Options are Awesome

Generally speaking, when I write in C or some other language that uses command line flags or POSIX-style interfaces, I try to stay away from spec extensions. There are numerous reasons behind this, including but not limited to acceptance, portability and stability. However, there is one exception to this rule. I love GNU option parsing.

At the command line, there are two general ways to pass options to a program. The first, and arguably most simple, way is through single character short options. These are applied like so:

grep -lrE "(extended)* regex"

As you can see, a single dash has been placed to indicate the beginning of an option. This is important grammatically, as this allows for the options parser to trivially determine if an argument is actually an option of simply a filename. For instance, grep should not parse the following for command line arguments:

grep "something special" file

Another thing that is worth mentioning is how multiple argments were combined into one for brevity. I could have written the above as the following:

grep -l -r -E "(extended)* regex"

But that exampel contains a large amount of duplicated information. I do not have to add extra dashes to get my meaning across, so why should I? Instead, many short options can be combined into one. There is one exception to this, and that is if a flag requires an argument, which is the next text to be parsed. Because of a quirk in how parsing works, this argument can either be combined with the flag or be separated by whitespace and it has the same semantic meaning:

sed -ibu 's/bad/good/g' file
sed -i bu 's/good/bad/g' file

Note: The “-i” flag is a GNU extension, but is widely adopted and, frankly, I don’t understand how the POSIX designers left it out. It’s practically a necessary feature.

Now, however, this is all well and good. But, at times, short flag names can hinder expressiveness. Take the following hypothetical example of a webserver startup. Which startup is more clear to you:

httpserve -p80 -l0.0.0.0 -faB ./public
httpserve --port=80 --listen=0.0.0.0 --fork --append --bar ./public

In this instance, I think a balance has to be struck, and some of those options are perfectly good as short options, but the fact remains that sometimes short options severely hinder expressiveness and readability, particularly in long pipelines or shell scripts. However, what you just saw was one feature of the GNU libc implementation that I think is an absolute genius of design.

The GNU designers wished to add a feature where an option could be written as a full word, rather than a single character. This introduces a certain amount of complexity into the C programming (unbounded strings are always a pain to deal with), but the addition is by no means severe. This sounds all well and good - so why didn’t they simple do what Go has done and implement flags like the following:

httpserve -port 80 -listen 0.0.0.0 -fork -append -bar ./public

The answer to this is that there is no semantic difference between the argument “-port” and the separate flags “-p”, “-o”, “-r” and “-t”. In order to support both short and long options, as I think is a good idea, something had to semantically differenciate a long option and a short one. So, what did they choose? They chose an already established identifier which had a very clear semantic meaning that would not be confusing for the parser: “–”. If the next character for an option is a “-”, redirect parsing on to the long option parser. Else, just do a regular short option parser. Simple!

Now, this does introduce a small problem. “–” is an already established identifier, as I mentioned earlier. You may not have known this, but placing “–” in a UNIX program’s command line forceably terminates option parsing. This is useful to be able to manipulate files or pass literal arguments which begin with a leading dash. Otherwise, it would be impossible to delete a file which is simply called “-file” or even just “-f”. The correct way to do this is to use rm like the following:

rm -- -f

This will delete a file with the name “-f”, rather than place rm in force mode. This way that the GNU option parser decides if this is the token to end parsing or begin a long option is simply the presence of whitespace straight after the token. This means that long options cannot begin with a space character, but - realistically - how likely was that to begin with? This is why I think that the choice of using the “–” token for this purpose was a stroke of genius.

Passing arguments to long options is a similar challenge. We obviously can’t pass them without any kind of delimter, as the nature of a long option requires that the parser treat any non-whitespace characters after the “–” as part of the option name. The designers decided to instead use a simple equal sign ("=") as the delimeter. Failing that, simple whitespace again. This is expressive, programatic and declarative - which fits all the stated goals of introducing long options.

In summary, I believe that the presence of these long options is a stroke of design genius. Although long options are a tool to be used lightly and to be implemented sparingly, I still believe that they are a fantastic piece of design. Their usage, in my opinion, should be similar to that of the git command line tool, which places most things as long options and creates shortcuts to them through short options.

This is one of the few GNU extensions that I use in most of my programs and will continue using for a very long time.

Ethan Marshall

A programmer who, to preserve his sanity, took refuge in electrical engineering. What an idiot.


First Published 2022-05-22

Categories: [ Old Blog ]