Tuesday, March 8, 2011

Documenting code - An alternative to Doxygen/Javadoc and others...

Documenting how things work is quite important, especially in environments where collaborative work is involved (such as software engineering) or when other people who were not involved in the development later need to use it.

One of the most popular ways to do so is using a tool called "Doxygen", which I believe was popularised by the C++ community for whom such a tool is probably strictly necessary to get a grip on the codebases involved (and which also mandated the creation of UML and other such tools/methodologies which are highly considered by software engineers). Another one, though more language antagonistic, is "JavaDoc", which is Java's assimilated dialect of Doxygen-ism.

However, while I'm not going to dispute the value of such documents (though IMO they should not be a crutch upon which bad code can be hung), I do take issue with some of the details of their workings, in particular, the syntax they use.

Doxygen Flaws - [1] Ugly Double-tag comments
First things first, Doxygen/Javadoc work by parsing the contents of comments in the source file. However, these are not ordinary comments... rather, they need to be started with a double-tag, for example in C:
/* This is a normal block of comments
 * which continue onto a second line
 */

/**
 * But this is an ugly Doxygen comment
 */

Having implemented a few parsers already, I can say that I understand why this implementation was chosen, though I don't believe that with a more intelligent implementation, the same can't be achieved.

---------------
Doxygen Flaws - [2] Needless markup and syntax
Secondly, there is a lot of IMO needless markup syntax, which perhaps is necessary for the parsers used there to make themselves useful. For example, the following toplevel tags are fundamentally braindead:
/** \file blender/editors/animation/keyframes_general.c
 *  \ingroup edanimation
 */
As you can see, this has been taken from a source file in the Blender source code, post-Doxygen-module-tagging. The \file tag should really not exist, just like the irritating $Id:$ keyword expansion tags that CVS popularised and drilled into the minds of some SVN refugees (for the record, I believe that these are more of a nuisance than useful, and I do not support the addition of this sort of tagging to files). This is information that should already be available to the documentation engine when it reads the file (how else does it see the stuff inside? Telepathic inference?! Wire tapping?!), which it should be able to easily make use of.

Admittedly though, the \ingroup tag is something that needs a bit more thought for now about what should be done about it, though it's likely that this approach is still too heavy handed.

---------------
Doxygen Flaws - [3] Needless markup which creates visual confusion
Leading on from the above, sometimes the usage of markup tags (so that Doxygen realises that something it wants is present) actually adds a lot more visual noise, without contributing that much to readability. In fact, I'd go as far as saying that in cases, it makes things worse.

Let's look at a few examples:
/**
 * \file my_source_file.c
 * \author Joshua Leung
 */
Even in this first example, the problem may not be that clear, but it's certainly clearer if/when we look at this without the tags being highlighted (as above). Looking at this, it's not totally clear where the tag ends, and the content starts.

Furthermore, the problems get much worse when we start using more complicated tags which take an argument (a name) followed by a description, as below:
/**
 * This is my fancy function. It does funky stuff.
 * 
 * \param a a string representing the fancy name we use  
 *        when nothing else works
 * \param theta a float that represents the spin of
 *        the earth
 * \return the magic constant for the world
 */
Take a good long look at those lines starting with \param.

Scanning those lines for useful information is tricky and error-prone. It looks ugly and confusing (i.e. "a a string" - WTF? a typo?!), as there isn't any visual way to break down the syntax into quickly readable units.

Furthermore, look at the number of keystrokes needed to type out that tyranny, eating away into "80 character" linespace (I don't particularly agree with those who religiously stick by it, dicing their code into tiny chains of staggered crimpets over 3-4 lines, though I agree that excessively long lines aren't that nice at times either). It should be blatantly obvious to an coder reading this comment in the code, with the function signature below it that these entities are parameters. But what this syntax doesn't show, is what exactly those parameters get up to...

Also, for languages such as Python, where variables do not have any explicit types, it can often be useful to be able to specify the datatypes of variables in a consistent way; something which this syntax makes difficult to find.

For the purpose that this syntax serves, it really doesn't do much service to those that the code should be read most by: humans. Code is ultimately written by us to be read by other humans later; machines just process the code as a mere necessity for making use of it.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There must be a better way!
...and there are. But before going any further, I'll also make a brief mention of yet another system: reStructuredText. Looking at this again now after quite some time, it does seem closer to many of the documentation formatting principles I've been developing (and will soon mention), and indeed there are some really nice aspects to the markup format it uses. However, my main gripe about this is that it does not go far enough to specify a good usable standard format, following on from the good start that they've made, for the task of code documentation - specifically, the documentation of inputs/outputs of functions.   

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My Proposed Code Documenting Scheme
This is basically still a very-much WIP scheme, with some holes to be filled still (as I find the need for those things, those voids will be filled). Having said that, I've been making extensive use of it in code that I've written over the past while, most notably throughout the source code for Duality SVN, though in other places too. Some day when I've formalised it enough, I'll probably get around to coding a proper parser for this to generate the fancy docs that people like to have with this.

Ok, so what are the rules?

Let's start with top-level stuff, such as the "author" tags we saw earlier. Here's an example:
/* This file contains gizmos for world domination
 * Author: Joshua Leung
 * Contributor(s): Joe Bloggs, Bob Blob,
 */
Notable features of this syntax are:
1) Out with the superfluous info - the filename and path of the file don't need to be mentioned. That is the job of whatever engine ultimately parsers the file to get from itself, or for you to find out by just checking the titlebar/tabbar of your text editor
2) Tags are delimited in a fairly standard manner - using a colon of course. This creates a visual separation between the tags and the content. We can clearly see which is which, and using this formatting is already second-nature to many people anyways.

Now for another crucial example - documenting a function. We'll now re-express that example we used earlier:

/* This is my fancy function. It does funky stuff.
 * 
 * < a: (string) the fancy name we use  
 *       when nothing else works
 * < theta: (float) the spin of the earth
 * > returns: (float) the magic constant for the world
 */
Let's dissect this a bit:
1) Out with the verbose tags - this is a lean+mean format, so we can just use a compact form for those commonly used aspects. In place of these, we get straight on with the business of describing and documenting our parameters, by adding some additional information about what they're used for (more on this later)
2) Parameter names are clearly separated from the description text by a colon. Do not underestimate the power of having clearly delimited text to read... it makes it much easier to see what you're looking for
3) Types are clearly indicated in a consistent way (in round brackets, like this little bit of text). This is mainly useful for languages where types aren't specified, such as Python or C-preprocessor macros.
4) Description text can clearly be seen - Only parameters which do anything which may be hard to guess really need to be documented with description text. If you're just going to end up writing generic robotic blabber (i.e. "value to store"), then skip it... any competent coder will understand
5) The "returns" label is just used to indicate that the output is the function's returned value and not just another pointer parameter (see later). I've highlighted this in the snippet above, since this is one of those aspects that I'm still debating about a bit - without it, things look a bit odd, but then again it is a sign of verbosity that we don't really need. Anyways, it should always be the last line in the comment block.

Now, for a little discussion of those little symbols before the parameter names.
"<" - This signifies an "input" parameter, just like with input/output redirection in the shell. An alternative way of remembering this, is to think that this arrow points into the "hose" (i.e. the starting characters on the lines of the block comment forming a chain like a hose), meaning that this variable pumps something in.
">" - This signifies an "output" parameter. This is just like the opposite of the previous entry, as (using that analogy) the arrow is leaving the "hose", so it represents something going out.

In C (or C++, but we don't speak of that disgrace here), pointers provide some interesting capabilities, but also some challenges, since we may not be able to immediately see from a function definition what may happen with such parameters. Never fear... this notation helps remove any ambiguity.
"<" - Can be used to signify that some pointer is only used to provide some data as input to the function. That is, the data pointed to won't get modified to store some of the outputs of the function.
">" - Can be used to specify that some pointer is used to allow multiple outputs from the function in addition to the function's return values
"<>" - Both symbols can be used to mean that a parameter supplies input to the function when it is first read, and then later it is where output gets piped out again for some reason or another. In general, this is probably not such good practice for simple variables, though it may be more commonplace + acceptable for pointers to structs which get modified as well as containing some variables which affect the result.

Let's see these things in practice then:
/* A silly contrived example
 * ! This may modify the generator state
 *
 * <> rng: (GeneratorState) 
 * > s1: (BigNum) part of seed vector
 * > s2: (BigNum) part of seed vector
 * > s3: (BigNum) part of seed vector
 * < (offs): (BigNum) offset to generator state after reading 
 * < (errBuff): (ErrorBuffer) junk basket
 * > return: (bool) whether a seed was successfully retrieved
 */
bool RNG_getSeedVector(GeneratorState *rng, BigNum *s1, BigNum *s2, BigNum *s3, BigNum offs=0, ErrorBuffer *errBuff=NULL)
{
    if (rng == NULL) return False;
    *s1 = rng->buf[0]; 
    *s2 = rng->buf[1]; 
    *s3 = rng->buf[2];
    if (offs) RNG_setOffset(rng, offs, errBuff);
    return True;
}
This example is really junk-code, but does show some more interesting aspects to this documenting system that I wanted to point out.
1) "!" symbol can be used on a line immediately following the initial description of a function to specify some cautionary notes that should be headed (i.e. things that should be done before calling the function, or warnings about side-effects).
2) Bracketing parameter names with "(...)" basically means that that parameter is an "optional" parameter - you can pass a value if you want, but you don't have to.

Future extensions/details to be fleshed out:
- Relational links/tags, such as labelling functions as being "helper" functions for such-and-such routine.
- Labelling which interface some functions are implementing (for OO-stuff), which is useful for identifying what isn't being overridden :)
- Imposing a formalised system of decorating chunks of code so that ordering of code within a file matters (related code is placed together), such that this can feature in the generated docs
- Actually implementing an intelligent parser for this that works on C-like and Python code (and/or is flexible for extension to other languages), and is able to generate HTML doc pages from this for browsing later. A key ability of this parser would be automatically inferring necessary information from the code instead of having this needing to be explicitly spelled out in doc comments using tags.

1 comment:

  1. I like the proposal on in/out params. I really feel like doxygen is just bogus distraction to coding, although I like the idea that it's trying to achieve. Thanks for the article!

    ReplyDelete