C SOURCE CODE STYLE AND MAINTAINABILITY AND SOFTWARE METRICS

90% of all expenditures on software during its life cycle involve software 
maintenance costs.  This is true because business requirements change. For 
example, changes in data and amounts of data cause unforeseen bugs. Or just 
plain performance bogging.  Companies love to reduce the overhead 
associated with source code change. If you don't believe me, check out how 
much contract maintenance is out-sourced to countries with really low pay 
scales. Low pay equals reduced maintenance cost.

The other side of this coin: Programmers love clever code, especially in 
the name of optimization.  Plus, programmers view non-programmers as 
absolutely incompetent to comment on any aspect of code, clever or not. 

For us as programmers, this means that if we are creating brand-new code, 
it is almost guaranteed that someone else will have to change the code 
later on.  The programmer changing the code will have to understand it 
first or risk making a bad fix. This article looks at things that make code 
easy to read, understand, and maintain, plus some other things that do just 
the opposite.

Please note:

Although this short article presents material about coding style, it is not 
meant to be a dogmatic statement of how to write C source.  You will find 
the words "rule of thumb", not "you must".  A rule of thumb is a rough 
guideline.

Importantly, this article is not about optimization. Programmers get so 
very wrapped around the axle on performance, that they worry about things 
they think might possibly be a concern.  Before they are a concern. 
Compilers produce the most efficient machine code when they work on simple, 
straightforward source code. Some compilers, for example the C compiler on 
HPUX 11.0, actually set optimization levels down when encountering really 
cute code. 

Back to the concepts at hand.

Because businesses are vitally interested in paying for maintainable code,
this puts a constraint on anyone working on that code, from creator to
maintainer. For C source code development, composing code in a way that
allows the Next Programmer to easily comprehend your code is a real
requirement.  Not imaginary.  Real world shops find ways to get rid of
coders who routinely play cowboy programmer and write deliberately obscure
code.

Researchers have been working since the 1970's to determine the factors
that help and harm programmer comprehension of all types of source
code.(Zuse et al, 1989)

In order to come up with a standardized way of dealing with the problem of 
comprehending source code and changing it safely, investigators have 
developed the discipline of software metrics.  Halstead (later summarized 
into Halstead 1977) started the whole idea of software metrics based on 
analyzing static code.  Following that, McCabe (McCabe 1976) developed a 
software metric called the cyclomatic complexity index that has turned out 
to have very real correlation to maintainability of code.

You can download a free software metrics package that generates a lot of
different metric values at:

http://www.lysator.liu.se/c/metre-v2-3.html (Metre v2.3).

Or visit

http://www.mccabe.com

if you want to read about commercial software that generates metric values. 
The IEEE has published a whole handbook on the subject of Software Quality 
Metrics. (IEEE 1998)

Managers act aggressively on the results of standardized metrics testing. 
Largely, because they know it has been proven to save money. A lot of other 
valid software metrics are out there, but let's limit the scope to McCabe's 
metric to keep this part of the discussion short.

The basic idea of the cyclomatic complexity metric centers around a coder 
being able to fully comprehend the primary source code, the data models, 
and control flow - the very stuff they are about to change.  Cyclomatic 
complexity values usually range from a low of around 5 to above 50. The 
index may be reported for a module, a function, or a whole suite of 
applications.

From T. Capers Jones book "Applied Software Measurement", 1979:

        "Empirical studies reveal that programs with cyclomatic
        complexities of less than 5 are generally considered simple and
        easy to understand. Cyclomatic complexities of 10 or less are
        considered not too difficult. When the cyclomatic complexity is
        more than 20, the complexity is perceived as high. When the
        McCabe number exceeds 50, the software for practical purposes
        becomes untestable."

The last sentence is one to note.  It means that we can change code with a
high complexity value, but it is not possible to prove we fixed it.  Nor
can we prove that we fixed one thing and did not break another.

Interpreting the cyclomatic index in more practical terms:

                  Cyclomatic         Probability
                  Complexity         of Bad Fix
                  ------------------------------
                  less than 10        5
                  20-30              20
                  greater than 50    40
                  approaching 100    60

Is this complexity index the end-all and be-all in software development? 
No. It just seems to work fairly well.  Also, it relates to the current 
topic because it attempts to determine whether or not the code can be fully 
understood and maintained by humans.

Most style-related factors (these all affect source code comprehension) can
be grouped like this:

1. CONSISTENT GOOD CODE LAYOUT
2. USEFUL COMMENTS
3. HIGHLY READABLE AND MEANINGFUL VARIABLE NAMES
4. CONTROL FLOW THAT IS "UNDER CONTROL"
5. FUNCTION LINE COUNT IS "FUNCTIONAL"

The cyclomatic complexity index is derived to some degree from all of these
areas. That should come as no surprise.


CONSISTENT GOOD CODE LAYOUT

Often, articles about C source code style will cite the prestigious
"Recommended C Style and Coding Standards", which coders know as the
"Indian Hill Guide". This guide is an attempt to bring style suggestions
together in one place. The most recent white paper is here:

http://www.psgd.org/paul/docs/cstyle/cstyle.htm

You may well ask, "Why bother talking about style if it's already a done
deal?"  The point we need to go over is not that the paper exists, nor that
you as a coder you must adhere to every word.  The real message is how to
take away a reasonable viewpoint about code styles and their impact on
development and maintenance.  As we have seen, there are serious reasons to
consider style as important as correctness because it affects correctness
as modifed by maintenance coders ultimately.

First, you must comply with style standards as managers choose to inflict
them on staff.  With a little research, perhaps you can educate dogmatic,
uninformed management about poorly chosen styles and their effect on the
bottom line. Show them they can save money on maintenance with a better
approach.

Secondly, there WILL BE a next programmer.  That next programmer could even
be yourself several years later.  If you decide to use an inconsistent,
obscure style it will find ways to come back to haunt you.

A quick story is in order here. We had a coder who was a Unix shell script
guru, i.e., a cowboy programmer.  And he tried to keep his guru status by
writing needlessly hard to understand scripts.  When he came back for a
stint as a consultant with us, he was quickly hoist on his own petard.  He
did not stay long because he could not decipher his own code.

Let's consider style and formatting.

The really bad part about all this layout stuff is dogma. Dogma means an 
unchanging belief. There are coders who have inflexible views about code 
layout or have been coding for so long that changing layouts on them causes 
them to have decreased comprehension.(Sheil 1981)  They spend more time 
fussing to themselves about the bracket placement than reading, it seems. 
Let sidestep the whole dogma problem for a while.

First and foremost: setting off subordinate code blocks with consistent
indentation is required for readability.(Miaria et al, 1983) This applies
no matter what style you adopt.  Or what language you code.

The space count or tab setting to indent all blocks should always be the 
same.  Don't use four spaces in one place and two in another. Research 
indicates that the highest programmer comprehension is in code with three-
space indentation, very closely followed by four-space indentation. With 
zero-space indentation or with six-space or larger, comprehension drops 
dramatically. (Maiaria et al, 1983).

Here is an example with three-space indentation, using a begin-end pair
block:

if(XXXX)
   {             // { is the begin block marker
   XXXXXXXXX
   if(XXXXX)
      {
      XXXXXX
      }
   }             // } is the end block marker


A rule of thumb:  if you have indented your module so many times that your
code line is bumping tails with the right screen margin, you have a serious
problem with control flow. Or even program design. Refactor your code.  Do
not reduce the space count in the indentations.

Blank lines between functions, data declarations, and some other code 
elements improve readability as well. According to research, somewhere in 
the range of 8%- 15% blank lines leads to the best comprehension.  More 
than 16% blank lines reduces comprehension significantly (Gorla et al, 
1990).

Code line lengths greater than 80 characters also reduce comprehension. 
(Gorla et al, 1990) You could try 75 characters per line if an 80 character 
line offends you, because slightly shorter line lengths work just as well. 
It's just going beyond 80 that is the issue. The researchers did 
evaluations not on screens but on paper.

Honor the 80th! Or the 75th. This means set a right margin of say, 80 
characters, and never go past it. It is possible to have more than 80 
characters in a single line of reasonable code.  Wrap it so it can be read 
easily.  Follow indentation rules on the wrapped lines.

Example:

Bad:

printf("%-20s %-20s %-20s %-10s %03d\n","Customer","Location","Settlement...

Better:(note this is four-space indentation)

printf("%-20s %-20s %-20s %-10s %03d\n",
    "Customer",
    "Location",
    "Settlement Values",
    "Last Est",
    Page_number);

Also better:

char *report_line_format="%-20s %-20s %-20s %-10s %03d\n";

printf(report_line_format,
    "Customer",
    "Location",
    "Settlement Values",
    "Last Est",
    Page_number);


Programmers often give names to source code format styles, like "K&R", or 
block indent. Feel free to use these names to impress other people. 
Those labels do not mean much because they are not universal. So for now, 
let's stick with these more standard terms:

pure block layout
emulated block layout
begin-end pair block layout


In C source code, the block identifiers are the open and close braces: { }.
You already knew that.

Below you will find the  three goals that good layout should have. They are 
in their order of importance. All three contribute to improved 
comprehension. The nice part about this is that when we follow #1, then #2 
follows almost automatically, as does #3. (McConnell 2004)

1. THE LAYOUT SHOULD ACCURATELY SHOW THE LOGICAL STRUCTURE OF CODE.

2. THE LAYOUT SHOULD CONSISTENTLY SHOW THE LOGICAL STRUCTURE OF THE CODE.

3. THE LAYOUT SHOULD IMPROVE READABILITY AND COMPREHENSION.

This is a pure block layout example, one with a block start and block end.
It has four-space tabs. Languages like PL/SQL and Visual Basic work well
with pure blocks because they have built in block start and block end
lexemes like BEGIN, END IF, END LOOP.

BEGIN
    XXXXXXXXXXXXXXX;
    XXXXXXXXXXXXXXX;
END;

We can have the open brace as an extension of the block beginning.  The 
close bracket is the end block marker and set off by itself.  Note the "}" 
end block marker isn't as clear in a readability sense as END-IF. END-IF is 
very clear, because it says what type of block it ends.  The "}" character 
can be the end of all kinds of blocks, from anonymous to function blocks. 
Here we emulate pure blocks in C:

if(){
    XXXXXXXXXXXXXX;
    XXXXXXXXXXXXXX;
}


Here is the begin-end pair layout, a C example:

if()
    {
    XXXXXXXXXXXXXX;
    XXXXXXXXXXXXXX;
    }

This modified begin-end pair layout is what you will often see in GNU source,
which is like the layout espoused by the Indian Hill Guide:

if()
{
    XXXXXXXXXXXXX;
    XXXXXXXXXXXXX;
}

Notice all of these layouts use consistent white space to set off logical
blocks.

The most important idea: research your style, find what works for your
shop, define it clearly, then STICK TO IT.  Most of all stick to it.

If you feel that using particular elements from different layout styles is
good, then try it. If you're in a big shop, meet with other coders in your
shop. See what they think.  It's better to have buy-in ahead of time,
otherwise you personally risk the stigma of "inflicting" style guidelines
on programmers. Even if a manager were to make the decision.

CODE COMMENTS

Comments are not a matter of personal preference.  Or dogma.  Useful 
comments have a positive impact on maintaining code.  The McCabe metric is 
influenced by a Halstead metric that detects comments.

Let's consider something like this(from GNU source):

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 */

My opinion is that this bit of text explains the real value of comments far
better than anything else ever could.

As a small digression, this bit of comment-ary also brings to light a basic
problem that coders have: they feel that the code they develop is personal
property.  Others must not touch.  Obviously this is not true, but how many
times have you heard  "Your code broke ...."?  If you want to work happily
in the IT world, you will need to drop the "my code, your code" outlook.

Research indicates that comments do, in fact, increase programmer 
comprehension.  Rather than cite research papers, let's consider a code 
fragment. Give yourself about 15 seconds to look at it.  No cheating, now. 
In general, the folks who research these kinds of things feel that a 
competent programmer should be able to fully comprehendd four or five lines 
of C code in a few seconds. Test yourself: What does the code do? How long 
did it take you to get it?  There is no way you can comprehend this in 
20ms, so don't try to claim you did.

#include <string.h>

char *func(char *s){
    char *p=s,sw,*q=strchr(s,'\0');
    if (*p){
        while (--q > p){
            sw = *q;
            *q = *p;
            *p = sw;
            ++p;
        }
    }
    return s;
}

Okay.  Now if we refactor a little, pick a little better function name, and
add comments:

#include <string.h>

/*******************************************************
* str_rev()
* return a reversed string
* swap characters until we hit the middle of the string
* argument char *source - string to reverse
* original string is lost
********************************************************/

char *str_rev(char *source){
    char *p=source;                /* start of source string */
    char *q =strchr(source,'\0');  /* end of source string   */
    char swap;                     /* temp storage           */
    if (*source)
    {
        while (--q > p)            /* decrement q first, then compare */
        {
            swap = *q;
            *q = *p;
            *p = swap;
            ++p;
        }
    }
    return source;
}

Which one is clearer?  It's a no-brainer to conclude that the second 
example explains what the code does, while you have to work through the 
first example yourself to understand.  And, leter, work through it again 
ten months later when the code comes around again.

Please notice that not every line has a comment.  Multiline comments are
above the function, not sprinkled around in code.  Wall-to-wall comments
are not helpful.

Let's look at how the comments at the top are set up.

There is a one line description of what the function does, followed by a
statement about how the function works. You may need to comment arguments
as well. You should also add a one-liner describing side effects if there
are any.  And, for a utility function that may have lots of arguments or
return complex values, a sample usage section is a great idea. You may also
be required to document changes to a function. If so, put the reasons for the
change up here, too.

Do all of these kinds of comments need to be there all the time for every
function? No.

The one line descriptive sentence always wants to be there, along with
other things as needed.  If you cannot write a short sentence that
correctly describes what the function does you are in serious trouble,
design wise. A reasonable function should not be so convoluted that it
requires a paragraph to describe what it does.  If it uses a bizarre
algorithm or it is hard to decipher because of optimization changes, then
the description of the algorithm might be justifiably long-winded. But what
the function itself does should not be a "War and Peace" sized
discussion, like this paragraph is starting to become.

Comments in the body of the function are usually limited to variable
declarations or steps you think are important or are unclear.  Place the
comment right on the same line as the code you are talking about.  Long
dissertations belong elsewhere, mostly at the top of the function.

The point of this little exercise is not to prove how fast you personally
can read code in decipher-this-code mode.  Rather, it is how comments make
it much easier to figure out a block of code.  And be refreshed faster when
you re-read the same code sixteen months later.

On the other hand, bad comments can really reduce readability, particularly 
in the body of a function. Excessive or pointless comments actually get in 
the way. I chose asm, not C, for this example. Did you ever see asm code 
like this?

MOV AL BL       ; move BL to AL

This is pointless commenting, because if the programmer cannot read the
line and know all there is to know about MOV, then the coder should switch
to a language he does know.  Comments are not in production code to
instruct wannabe programmers.

On the other hand:

MOV AL BL       ; move BL to AL to start processing string section $TEXT

This is useful because it says why we are doing the MOV.

Continuing the line of thought about comments getting in the way, let's
look at comments used to deactivate code. Sigh. 

Some shops require that all code to be cut from a source document should be
commented out, not removed from the source. In compliance with this
directive, programmers sometimes use preprocessor statements to do the same
thing.  The preprocessor trick is just programmer conceit, as I see it. C
code with a bunch of #if 0 directives becomes even trickier to read.

Why trickier?  The #endif statement can be part of any valid directive, not 
just the #if 0.  Therefore, it becomes unnecessarily hard to evaluate when 
you have read past the end of a big chunk of dead code, when your end
delimter and your comprehension depends solely on #endif.

On the other hand, the */ lexeme is used for end of comment, not a range of 
things like #endif.  So end of comment does not immediately have the same 
potential for mass confusion.  But it doesn't help "set off" all the lines 
of dead code in between the start and end comment markers.

Someone fast skimming code may miss the end of either one of these code 
block removal methods mentioned above here. Whichever one you choose, above 
all, you want to be consistent. Don't mix and match "#if 0 #endif" blocks 
with commented out blocks.  Consider not using "#if 0 #endif"

If you are required to remove C source using directives by your boss, okay.
You have to do it that way.  But be sure set it off so it can be seen as
removed.  See an example below.  Otherwise, just use comments to get rid
dead code if you are required to do so.  And set them off too, to improve
readability.

What is setting off?  It is the addition of line-by-line visual clues. The 
issue with both of these code removal methods is that unless there is a 
flag ON EVERY LINE REMOVED to indicate it was removed, either method has 
the potential drawback of seriously degrading comprehension. Because it can 
be hard to see what ends where. You can run utilities stripcom to render 
endless and pointless comments in C soruce more into a more readable form.
stripcom is part of the Metre v2.3 download. Be aware it removes useful 
comments, too.

When you look at these examples below, imagine that 40% of the lines in a 
single 150 line function are commented out. To get the next line of live 
code there are so many intervening comment blocks you have to skip over a 
whole screen of junk just to see the next real line of code. When single 
code blocks span several screens or pages, coder comprehension goes down. 
When you're doing that kind of skimming you need thiose extra visual clues 
to help find the next "real" line. Adding something to flag each internal 
inactive code line is helpful. It's a kind of comment as well.  It says 
"ignore me". Some coders use "**" in the left margin. Or some other 
distinctive combination of characters that won't confuse the compiler.
Or another coder.

Bad:


#if 0
for(i=0;i<10;i++)
{
   printf("Rate %2d: %10s\n",i,rate_name[i]);
}
#endif

A little better:


#if 0
** for(i=0;i<10;i++)
** {
**    printf("Rate %2d: %10s\n",i,rate_name[i]);
** }
#endif

Not so great, either:

/* comment out jmc 12/17/2003
for(i=0;i<10;i++)
{
   printf("Rate %2d: %10s\n",i,rate_name[i]);
}
* commented out by jmc */

Better, with begin-end visual clues:

/* commented out ************** jmc 12/17/2003
** for(i=0;i<10;i++)
** {
**    printf("Rate %2d: %10s\n",i,rate_name[i]);
** }
******************************* jmc 12/17/2003 */

Any trick you can use to make each and every commented-out line visible 
will help.  Be consistent. Having extended visual clues, the extra "***", 
for the start and end of a block like this is not a bad choice either, as 
in the last example.

If you are not required to leave the dead code in situ, consider flagging
the spot where the dead code used to live:

/* code removed jmc 02/03/2001 see SNIP 13 */

Then give the code a decent burial. Create a code cemetery beneath all the 
live code. Cut and paste the dead code down into the cemetart start and 
give it a tombstone. Like this:

/********************************
*
*  Code cemetery
*
**********************************/

/* SNIP 13 from function derive_the_loan_total() */

/* commented out ************** jmc 02/03/2001
** for(i=0;i<10;i++)
** {
**    printf("Rate %2d: %10s\n",i,rate_name[i]);
** }
******************************* jmc 02/03/2001 */

This option requires cutting and pasting, but leaves the live code free 
from clutter while meeting the requirement: comment out the old code.

And then there are pointlessly cute requirements for comments, like making
neat boxes out of them.  All this accomplishes is to add extra keystrokes
for every programmer who touches the code.  Lining up the comment "*/" on
the right margin is a pain in the butt, unless your editor has box
operations, like Ultredit.  Doing it in some editors can be painful.  Or
cause yet another trip to the high-caffeine beverage dispenser.

Pointless, but cute:

/******************************************************/
/*                                                    */
/*   This is a nice way to waste programmer time      */
/*   creating cutesy comments.                        */
/*    If you don't believe me cut and paste this box  */
/*    and then try to change all the text inside      */
/*    to a different length. Count your keystrokes.   */
/*                                                    */
/******************************************************/

This cute comment style does not detract from readability, it increases 
coder griping.  If you need grumpier coders in your shop then enforce it, 
by all means.

Same effect with less formatting effort:

/********************************************************
*
*  If you have to have cute boxes, try this format
*  instead.  It's a lot easier to edit and reformat
*  Try the same kind of reformat you did above.
*  Count your keystrokes.  This box wins!
*
********************************************************/

Misuse of comments is a big problem in some shops.  Don't let yours be one
of them.

Back to useful comments.

When we refer to useful comments, we mean comments that add to reading
comprehension.  We don't want comments that are just the opposite, like
the very first strstr() example in this section. Included in the same hate
list are some of the junk comments discussed above.

Comments should definitely appear in C header files.  And they are 
desirable, believe it or not.  Consider using the same rule set you have 
defined for comments everywhere, in all modules and header files.  Header 
files are code in spite of what a few coders seem to think, i.e., comments 
never go in header files.

USEFUL MODULE COMMENTS ARE MOSTLY BOOKKEEPING

1. Simple change history information. Place detail above the function.
2. Version control information (also see ident section below)

USEFUL COMMENTS PROVIDE THE FOLLOWING INFORMATION IN A BLOCK
IMMEDIATELY ABOVE THE FUNCTION:

1. First sentence: what the functions does
2. What arguments, if any, the function takes
3. What it returns, if anything
4. Important side effects, if any
5. Sample usage, maybe
6. If required, the reason for each change
   in the function goes here, along with coder name and date.

How you choose to format your comments is up to you, but it must be
consistent.

C99 allows the // for one-line comments, which is really easy. You don't 
accidentally leave behind a dangling "/*"  start of comment block thingy 
(the coder's technical term for a lexeme) to cause compilers to vomit later 
on.

Some C compilers will not accept // as a comment delimiter, especially 
older ones. The // is part of C99 (ISO99), so C99-compliant compilers will 
deal with it.

FOR THE CODE INSIDE THE FUNCTION BLOCK, CONSIDER THESE AS USEFUL COMMENTS:

1. Description of the variables or their use
2. Plain language explanation of "tricky code" in algorithms.


Inside the function block, useful comments are usually on the line of code 
they reference. Sometimes a really complex coding trick will require 
several lines of explanation. A good rule of thumb is to place a bigger 
comment block above the code it references. If the comment is really large 
it goes above the function, maybe with a simple reference left on the line 
telling the reader to look for it:

/* see note above */

How large is too large a comment block to put in the code section?  When
you cannot easily gloss over the comments to see the flow of the code. When
the comments are in the way in other words.

An example of an interesting application of this sort of mega-comment is
the discussion of "magic bits" in an implementation of memchr() in
memchr.c for GNU glibc-2.3.3  However, the author does go on at great
length, right in the middle of the function, to the detriment of reading
the code, so you can evaluate whether that little trick is something you
want or not.

Download glibc-2.3.3-20031202.tar.bz2 from:

http://www.fr.linuxfromscratch.org/view/lfs-cvs/chapter05/glibc.html

The use of an in() function module like this one is also referenced in the 
control flow section. (the in() idea is discussed later on)  This approach 
is helpful when you have to compare long lists of variables that are messy-
to-compare data types.  And you want to be able to read the Boolean 
statement without getting a headache, i.e., a one-liner.

Here is an example of an in() function, with a commenting style that is
middle of the road:

/* example code: in() ********************/
#include <stdarg.h>
#include <stddef.h>
#include <string.h>

typedef
enum
{
   CHARPTR=0,
   DBL=1
} dtype_t;

/*****************************************
*  compare doubles to see if they are equal
*  return 1= equal
*         0= not equal
******************************************/

int floating_pt_equal(double a, double b)
{
    int retval=0;           /* return value 1 or 0    */
    double test=0.;         /* store difference of a & b */
    double epsilon=.00001;  /* arbitrarily small number */

    test=a-b;
    if(test<0)              /* absolute value of test */
    {
        test*=(-1.);
    }
    if(test < epsilon)      /* if test is close to zero, a & b are equal */
    {
        retval=1;
    }
    return retval;
}

/******************************
* in(void *source, int num, dtype_t my_datatype,...)
* find source in a list of values
* source = value to find
* num = number of arguments to search
* my_datatype = datatype to use for comparison
* returns positional parameter which was found
* i.e., if third parm is a match, retval=3
* when none found returns zero.
* usage:
* result=in(my_char_ptr,3,CHARPTR,"HI", "WHEN", "IF");
* result=in(mydouble_variable,4,DBL,1.123, 4.2, 1.3, 1.4)
*********************************************/

int in(void *source, int num, dtype_t my_datatype,...)
{
    int retval=0;                  /* return value 0 or 1...num */
    int counter=0;                 /* track positional paramter */
    char *test_ptr=NULL;           /* for char* compares        */
    double test_double=0.;         /* for double compares       */
    va_list argptr;                /* arguments                 */

    va_start(argptr,my_datatype);  /* begin variable argument processing*/
    for( ;num;num--)               /* for each optional argument */
    {
    	counter++;
	    switch(my_datatype) /* process by datatype*/
        {
         	case CHARPTR:
         	    test_ptr=va_arg(argptr, char *);
         	    if(strcmp((char *)source, test_ptr)==0)
         	    {
         	        retval=counter;
         	    }
         		break;
         	case DBL:
         	    test_double=va_arg(argptr, double);
         	    if( floating_pt_equal(test_double, *(double *)source)==1 )
         	    {
         	        retval=counter;
         	    }
         	    break;
         	default:
                break;
        }
        if(retval>0)        /* explicit early loop exit */
        {
            break;
        }
    }
    va_end(argptr);
    return retval;
}

Sometimes a complete algorithm description of weird code is appropriate. Be
sure to put it in the top comment block because if it is wierd, it should
be proabably a long-winded explanation

Change histories, while they are comments, are primarily useful to 
determine what has happened to the code, and what source version you are 
dealing with. Version histories and change histories frequently get out of 
hand. Strip out all these kinds of bookkeeping comments periodically to 
keep the code cleaner. Removing previous change history comment markers 
with every major release of code is also something to consider.

Instead of filling up comment areas with babble about versions, you can 
choose to place version comments in the code so it shows both in source 
code and in an executable image file (compiled exe). Cosnider this a sort 
of "you get twice as much for the same cost" deal.

If you elect this method you can determine what versions of your modules 
were linked to create the executable file in front of you.  Which is often 
very useful when you have to determine why production code fails.  Or even 
if you have the right version of code.

Let's use the str_rev module from above as an example.  Assume we create a
C module file that only has our str_rev code in it, with one added code
line near the top:

#include <string.h>

static char *str_rev_code=
"@(#)ACOM $Header:/main/str_rev.c v1.21 02 Apr 2003 11:55:30 jmcnama $";

/*******************************************************
* str_rev()
* return a reversed string
* swap characters until we hit the middle of the string
* argument char *source - string to reverse
********************************************************/

char *str_rev(char *source){
    char *p=source;                /* start of source string */
    char *q =strchr(source,'\0');  /* end of source string   */
    char swap;                     /* temp storage           */
    if (*source)
    {
        while (--q > p)            /* decrement q first, then compare */
        {
            swap = *q;
            *q = *p;
            *p = swap;
            ++p;
        }
    }
    return source;
}

When this module is compiled and linked into a project, the Unix ident 
command will find all of the what-strings (the "@(#)" chunk) and know to 
display the contents. DOS and Windows versions of what and ident are also 
available if you care to google around for one.

An ident display looks like:

/dev1/general/exe/bigfile:
    $Revision: 92453-07 linker crt0.o B.11.33 020617 $
    $Header:/main/bilp.c     1.21   02 Apr 2003 11:55:18   psmith  $
    $Header:/main/astdf.c     1.1   30 Jan 2002 11:17:18   pjohns  $
    $Header:/main/arpfe.c     1.1   30 Jan 2002 11:11:44   paster $
    $Header:/main/str_rev.c     1.21  02 Apr 2003 10:55:30 jmcnama $
    $Header:/main/orac2.c     1.2   24 Jul 2002 20:44:36   bsmith  $
    $Header:/main/lib/com.c   1.2   24 Jul 2002 20:45:08   bjones  $
    $Header:/main/lib/orac.c   1.1   31 Jan 2002 11:25:06   pjones  $
    $Header:/main/lib/prnt.c   1.1   30 Jan 2004 17:26:40   psmith  $

The Revision: line came from the linker. Many versions of ld on Unix boxes
do this.  We created the other lines using our "ident strings" trick.

If you update the ident string (or have software do it for you every time 
the code is checked into a code management system like cvs) then you can 
identify what modules make up the extant code.  You can spot version 
mismatches right away. This is more useful long term that just slapping a 
quick comment into version history.

VARIABLES AND VARIABLE NAMES

This section also applies in large part to naming functions.  So please
keep that in the back of your mind.

You have probably seen something like this exercise before. Try to
read what this sentence says - read as quickly as you can:

HLEP I AM A PRSIONER IN A COOIKE FATCORY.

You read it, right?  The spelling has problems, on purpose.

When reading Western alphabetic text, we humans use space to delimit words, 
then the first and last letter of a word as the first pass at identifying 
the word. Most people have no trouble decoding HLEP into HELP, even if the 
spelling gives us heartburn. When reading use the first and last letter ot 
a word to comprehend the word.  We also try to figure out words based on 
the context of the word in the sentence with the same lack of regard for 
spelling.

Next, we look for the presence (not always position) of letters inside the 
word. This is probably how you were able to decode FATCORY into FACTORY. If 
neither of these efforts solves the word puzzle, the last thing we use to 
decode a word is the absolute position of the letters inside the word.

The reason we humans can read the garbage sentence above is because the
words are all English words, the words all start and end with the correct
letters, and all have the right letters inside, a bit scrambled.
This may come as a surprise, too, but programmers are human. Very human. So
human programmers reading source code use it, too..

Let's make it messier. Suppose you have two nonsense words, buon and boun. 
Decoding these words requires stepping through all of our little decoding 
tricks above. However, since they are not real words, we can't shortcut the 
procedure by making inferences about what the words ought to be from 
context in a sentence. They are pretty much nonsense.

In dealing with only two nonsense words, maybe shortcuts might work, but 
with umpteen little ugly nonsense words, forget it. We have to struggle all 
the way through each one. The more stepping through each word we are 
forced to do, especially for nonsense words, the more likely we will 
falsely decode the word. Or get lazy about it.  

If there are dozens of confusing words like this, then the decoding process
really slows down and becomes unreliable. In other words, we become
confused, angry, hungry, or simply decide it's time for another high-
caffeine drink. None of these reactions increase reading comprehension.

We use confusing nonsense words all the time in code. We call them variable
names. Function names.  Poorly selected names introduce confusion at the
point in reading exactly where it is least needed.

Here are some variable names from a huge (11500 lines of C) code module:

char *ubpbpfm_cust_code=NULL;
char *uzpbpfm_cust_code=NULL;
char *ubzbpfm_cust_code=NULL;
char *uapopen_cust_code=NULL;
char *uabopen_cust_code=NULL;

These are hard to tell apart, aren't they?

When debugging and tuning some old code, I found several places where the
wrong variable name from the above list was used.  Should this be a
complete surprise?  The u... part of the variable names are database table
names. The programmer was trying to self-document the code, a good idea.
The problem is, the table names are from a really old application that
limits table names to seven characters.  And all of the tables start with
"u". Plus, all the table names have a lot of characters in common.

The biggest problem is that the differences are buried deep inside the 
first "word" and those first words are not part of any human language. 
They are complete garbage. While reading crud like this, errors creep in. 
Rule of thumb: Avoid long lists of extremely similar, nonsense names.

If you run a linter on old or new code, which you should do, how many times
have you ever had lint flag something like this - or just plain seen it
yourself with no help?

for(i=5;1<9;i++)
{
    printf("Value: %-10s\n",array[i]);
}

I found this gem embedded in a very large code block in an excessively huge
C program (~40K lines in one file, 20K in another copybook include C file).
Don't ask why a copybook, please.

Obviously, years of processing and testing never hit this loop.  Had it 
ever been run it would have crashed and been fixed because the loop does 
not terminate, but the array eventually does and would have segfaulted. 
This example arose from a different point of view, but is another version 
of the same problem: inability to discriminate between elements in the 
code. In some editor fonts, the character i looks like numer 1, same 
problem with character l and number 1.  I'm assuming a poor choice of 
fionts caused this problem. 

The difference between 1 and i did not stand out in the editor during 
development. Correct perception of the variable name by the coder is 
everything. If the coder can't see it and the compiler doesn't care, it's 
all over but the tears.

Needless to say, this code burp does not create warm fuzzy feelings about
the overall development and QA process that let it slip by, either.  That's
another topic, though.


So we come down to two rules of thumb:

1. LIMIT THE NUMBER OF "UNDOCUMENTING" VARIABLE NAMES, KEEP THEM FOR
POINTERS AND LOOP COUNTERS.

Variable names that coders expect for this use are p and ptr for a pointer
and i (and maybe j and k) for integer loop counters.

Examples:

char *p=NULL;
char *ptr=NULL;
for(int i=0;i<10;i++)

Names like this abd the everpresent func() are really awful for function 
names.  Just in case you were wondering.

2. INSIDE A FUNCTION TRY TO FIND MEANINGFUL AND DIFFERENT LOOKING VARIABLE
NAMES.

Variable names in the range of 4-9 characters, without internal separators,
give the highest comprehension scores (Sheil 1981), so avoid long names
without separators, examples of bad choices:
residualcheckcount
resultcheckcount

Separators help a lot with long names.

Now that separators have come up, an explanation of them is due. In terms 
of the actual format of the names, the use of underscores or capital 
letters to separate words increases readability. These are separators.  The 
original separators are the hypehens: "-" in COBOL.

If you choose to use separators, then consider whole words in variables, 
not a lot of "abbrvtns". Separators let you employ longer and more 
meaningful names that are still easy to discern.

Examples:

harder to read: runningtotal, sumofsqrs
easier to read: RunningTotal, sum_of_squares

Older compilers may enforce really short variable names (seven characters), 
so the separator trick is out for them.

The very most important thing in choosing variable names, call it a rule of 
thumb "plus" is:

THE VARIABLE NAME REFLECTS ACCURRATELY WHAT IT REPRESENTS

sum1, sum2, x1, x2 are not a good choice. This is because we can see
these variables probably hold intermediate values, but what intermediate
value are they holding?

Remember, you may have inherited code from one the folks who think it is
okay to write 1400 line functions. You are the next person. You as Next
Programmer get to dig around through hundreds of lines of code to figure
out what these variables really store.

One bad thing you may discover during your dig is that the variables store 
different things at different points in the code. Because their naming is 
close to anonymous all-purpose cubby holes, coders will park almost 
anything in them that suits their fancy. With no feelings of guilt.

Preventing this kind of confusion is another good reason to use highly 
descriptive names - a programmer is not likely to get lazy and store 
returned check fee totals in a variable named SumOfVariances.  x2 may be a 
different story.


Examples of good and bad name choices:

Purpose                          Good                   Bad
-------------------------------- ---------------------- -----------------
Store the running total of       NSFRunningTotal        NSFsum1, tmptotal
NSF charges for the report

Sum of squares                   sum_of_squares         sumsq, sum_sqs, x1
                                 SumOfSquares

Another answer for reducing variable name confusion is the use of Hungarian 
Notation.  Some coders hate it, other love it....

Hungarian Notation is an attempt to have the name of the variable indicate 
it's data type. The method places a leading wart to identify the data on 
the variable.  A char variable named 
    AdjustmentFlag 
becomes 
    chAdjustmentFlag
-- the leading ch designates a char datatype.    

If you opt for this method, use it consistently everywhere.

There isn't a lot of research to support the notion of increased 
comprehension value with Hungarian Notation -- other than the difference 
effect of the warts (warts are the prepended doodads) lends two variables 
with similar names:

szCheckCount
intCheckCount

For languages like Visual Basic where certain data types like Variant are 
chimeras and can be any data type at anytime, Hungry Notation helps 
comprehension of code.  We might expect carryover to other languages 
that support similar features. 

Finally, in terms of confusing variable names, beware of creating global 
variables, especially those having the same name as local variables.  As an 
example consider the miserable little code fragment below.  What will be 
the output?  Don't you just love these wonderful and confusing multiple 
declarations of i?

miserable.c:

int i=0;
//........... assume  lots of lines of code

int myfunc(int cost)
{
    int i=1;
    // ......... assume 20 lines of code here
    if(cost)
    {
        int i=2;
        // ....assume 50 lines of code are here.
        cost=1;
        cost+=i;
    }
    return cost;
}

int main()
{
    int j=0;
    for(j=-10;j<11;j++)
    {
        printf("%d\n",myfunc(j) );
    }
    return 0;
}

Well, to answer the question, you have to know which "i" is in scope at the 
point where cost is assigned a value inside myfunc(). In this case it's the 
innermost i, with a value of two. So, the function returns either 0 or 3. 
The answer isn't hard to see in this example, which is why it's here. This 
is not the case in real life where things like this can get horribly ugly.
Horribly fast.

Suppose the if block was 120 lines down from the function variable 
declaration of "i", and that there were lots of intervening lines in the if 
block. In this less amusing case, you have to search around to find out 
which "i" is in scope, assuming you remember that the if block declaration 
of the "i" variable is also there. Ugh. Too much name confusion.  Avoid 
it.

Generally, try to minimize the use of global variables, and when you need 
to use them, be sure to have meaningful names for them, and do not 
accidentally create global names that are also embedded 2280 lines below 
inside a function.  This means that p, ptr, and their friends are not good 
candidates for global variable names. 

Some shops require so-called mangled names for globals, that is: global 
variable names must start with an underscore character.

Example:

char *_charBadDebt=NULL;

Another proposed workaround for this problem of name confusion is to adopt
Hungarian Notation and use prefixes like gbl for variables that have global
scope.

Example:

int gblIntCount=0;

But how do you handle the naming of all of the "i"'s in the example above?
No answer -- don't copy the miserable sample code.

Control Flow

The way in which you decide to code program logic, or control flow, has a
profound impact on readability.  You can create many different control flow
structures, all logically correct, and all very different looking.  Some
are hellish to maintain, others heavenly.

Speaking of Hell, let's take a turn toward the negative. An example:

"Not one of us unhates non-positives in non-simple statements."

Huh?  How about "None of us loves negatives in complex statements".  What 
is the point of this drivel?  Sentences that are all negatives are hard to 
comprehend in English and other languages, like C and BASIC,too.  This is 
exactly like those Boolean statements filled with not logic that also drive 
us insane.  

Positive Boolean statements are much easier to understand and therefore are 
less prone to mis-reading.  And mis-using. They also present clearly what 
is being tested rather than the fuzzy set of conditions in "everything else 
possible in the universe except this".

Here is an example code fragment:

if(!status)
{
    complain();
}
else
{
    write_good_record();
}

While fragment above can be understood, this "reverse" is even clearer:

if(status)
{
    write_good_record();
}
else
{
    complain();
}


While speaking ill of not logic, we should look at null if statements. 
These bad boys are akin to the same nasty group of problem Boolean 
statements as our not-friends above. Here is an example straight from 
production code, with comments removed to protect the guilty.

To reduce future email, do you suppose that I put the goto in there?  The 
answer is "no".  Do I espouse lots of goto's as well?  No. Can goto's ever 
be used?  Yes, in special situations.  For examples on both sides of the 
goto equation see: Knuth 1974.

Awful code:

if (*out_of_cycle_ind &&
    (ask_out_of_cycle_ind[0]=='Y' ||
     ask_out_of_cycle_ind[0]=='N') )
    ;          /* <- this is a null if :(  */
else
    goto parms_not_valid;

The reason this mess presents problems is that you are taking action on a 
set of not results. In this example the value "is the set of all things not 
tested for" condition that translates into a goto. Which values, in case 
you don't see it right away, are impossible to know from this code 
fragment. Worse, things not tested for may remain forever unspecified.  A 
linter will flag this kind of construct for you. A long series of 
constructs like this is hard to debug.  And to understand.  And esy to break.

If you find that absolutely the only possible solution is to keep the null 
if line - the line containing only a semi-colon - then consider creating a 
dummy define or function, call it DO_NOTHING, so that you can explicitly 
state that you want to do nothing at all.  It is far too easy for the Next 
Programmer to miss the single ; line, especially when the if(...) section 
is overgrown. Finding a way to avoid the if null construct is the better 
choice by far.

Example DO_NOTHING, less awful:

void DO_NOTHING(void)
{
    return;
}
// ..............

if (some set of conditions)
    DO_NOTHING();
else
    goto naughty_answers;

The first code fragment was taken from a long series of Boolean statements 
that validate user input. There were 14 of these wonderful null if tests 
all in a row, just like the first one: a null if statement and a goto 
with subsequent else if statements.  This presents a double whammy for 
maintenance - null ifs embedded in a long series of Booleans -- all blocked 
together.

Negative exclusionary logic is testing for conditions not explicitly tested 
for but implicitly assumed to exist.  This kind of test is bad news.  Most 
especially in writing secure C code.  When deciding when to accept 
poisonous user input, ALL user input is, always test for exactly what to 
include. Never test what to exclude. Do not test with the assumption that 
you can spot all the bad stuff.  Most particularly never assume that 
because it passes some set of negative test that the data is okay. Assume 
it is not okay instead. Nice guys finish last in hacker wars.

Next, let's take a look at refactoring overweight Booleans into a table, a 
switch statement, or an in() code construct from the in() code above

Sometimes one programmer sets up a condition, then others get hold of it 
later and add more. This kind of chaining of Boolean statements can get out 
of hand:

if ( (printed_date[3]=='J' && printed_date[4]=='A' && printed_date[5]=='N')
  || (printed_date[3]=='F' && printed_date[4]=='E' && printed_date[5]=='B')
  || (printed_date[3]=='M' && printed_date[4]=='A' && printed_date[5]=='R'))

If you like this style with lots of Booleans, you are in the minority.  Too
much is plain overwhelming, and it becomes too easy to generate code faults
in changes later on. Plus, it does nothing for improving understanding of
the code.

We can refactor it to something simpler:

if( strncmp(&printed_date[3], "JAN", 3)==0) ||
    strncmp(&printed_date[3], "FEB", 3)==0) ||
    strncmp(&printed_date[3], "MAR", 3)==0) )

If the statement involves a larger number of Booleans consider a table.  Here
is the above "if" code fragment refactored to test for the month names JAN-
JUN using a table:

const char months[6][4]={ "JAN", "FEB", "MAR", "APR", "MAY", "JUN"};
int found =0;
int i=0;

for(i=0; i<6; i++)
{
    if(strncmp(&printed_date[3], months[i], 3)==0);
    {
        found=1;
        break;
    }
}
if(found==1)  /*explcitly test found for the exact value*/
{
    // do something
}


While this involves more lines of code, it is easy to maintain because it 
reduces confusion arising from a gaggingly long series of Boolean 
operations. The early break condition for the loop is explicitly stated, 
instead of being buried inside the for loop header. The buried trick is 
what your CS 395 instructor would have loved:

for(i=0, found=0; i<6 && !found; i++)

Buried equals obscure. The question you want to answer is: Which style is 
more readable? The question you do not need to answer: Which style is more 
likely to please my CS 395 instructor?  

For themonth example as currently defined, a switch() is not a 
reasonable choice largely because we are not comparing integer values.  But 
anytime you see line after line of Booleans, always think of a switch as 
the way to make control flow clearer.  

For this same problem, a slightly more advanced approach might be to create
an in()-like function for hard to compare data types, that uses variable
arguments so you can code something like this:

if( in(month_name, 6, CHARPTR, "JAN","FEB","MAR","APR","MAY","JUN")>0 )
{
    // do something
}

Or:

switch(in(month_name, 6, CHARPTR, "JAN","FEB","MAR","APR","MAY","JUN") )
{
    case 1:
        tax_processing();
        january_processing();
        break;
    case 2:
        february_processing();
        short_month_processing();
        break;
    case 5:
        june_processing();
        end_fiscal_year();
        break;
    default:
        month_end_processing();
        break;
}

The idea is to create something that is more readable and flexible.

Come up with other approaches, this is just small example.  Implementations 
of the SQL language support an IN operator. Programmers familiar with the 
SQL IN() construct will find it a no-brainer to use.  You can come up with 
a quick in() alternative, such as using strstr() to search the month names. 
The only advantage of the in() example is that it can handle several clumsy 
data types fairly easily.  The disadvantage is that it requires extra code.

For testing lots of integer choices, switch() is the best choice.

It is time to consider a refactor the code when:

1. YOU SEE NULL IF BLOCKS
2. YOU FIND NEGATIVE LOGIC
3. THERE ARE INTERMINABLE BLOBS OF IF-THEN-ELSE BLOCKS
4. YOU SEE if(...) WITH A LOT OF BOOLEAN OPERATORS


FUNCTION SIZE - NUMBER OF LINES IN A FUNCTION

A rule of thumb: three editor screens worth of lines of code is a
reasonable upper limit for functions.

In general, as functions get larger, software faults increase(McCabe 1976).
This is not saying three screens of code is a magic number, but code faults
are increased both with lots of small functions and one giant function. See
the comments and reference below on small functions.

Three screens probably works out to between 100 and 140 lines for an upper
limit, unless you use an ancient IDE like the one with TURBO C v3.0.

Is there a lower limit?  Probably functions with two or three lines are
possible, though not necessarily a good idea.

When two, three, and four line functions become worth considering is when 
there are repeated code lines.  Code repeated over and over in different 
places. The repeats could be in one function or sprinkled around throughout 
the code.  Coders frequently use macros for things like this.  Macros have 
some other interesting and possibly negative effects, but are suitable. 
This is part of the coder paradigm "Don't Repeat Yourself" or DRY. Do 
something in only ONE code section, and do it well.  This applies across 
huge applications as well.

You can also consider placing those repeating lines into a little function.
The result is localized code for future changes. 

When you consolidate code into a macro or a function, a correct change 
means modifying the one code block instead of scouring hundreds of lines 
looking for similar code to modify. DRY idea. It may also reduce the 
overall reading complexity of the calling function's source.  Comprehension 
improves because us coders tend to more readily identify fairly short pieces 
of redundant code, rather than long redundant code sections. (Sheil 1983)

For example, if we had a function Capitalize() called in several places, it 
would be easier to recognize than a bunch of copies of a loop that does the 
same thing. This is really true when the variable names vary a lot from 
loop to loop. Some of these same arguments can be put for macros, except 
that coders tend not to document macros the way functions get documented. 
Who knows why? Maybe because macros tend to be in header files, which some 
coders view as non-source code. Plus, various compilers have varying limits 
for the number of arguments for macros.  And you cannot debug a macro with 
a lot debuggers.  Functions can be debugged.

There is a down side to having lots of small functions. The presence of 
many small functions increases fault density in software, having the same 
nasty effect on fault density that really big functions have.  While this 
seems contradictory, the fact is that function interfaces are a source of 
faults in software. (Withrow 1990)  The other issue with loads of small 
functions is finding bugs.  The cause of the problem may be in a function 
called way back there somewhere in the stack frame, several function calls 
before the function that actually shows the problem.  In a large function 
this does not occur as often.

For these reasons, a middle of the road approach on function size seems to 
be appropriate.

A rule of thumb: WHEN CODING THE EXACT SAME SET OF OPERATIONS, WHENEVER
FUNCTION LINE COUNT INCREASES OR NUMBERS OF SMALL FUNCTIONS INCREASES, THEN
FAULT DENSITY GOES UP.

In plain terms, you increase the probability that your code will break or
become unmaintainable if you have really long functions or lots of tiny
functions.

Maybe a different approach for signaling function size limits is in order.
Let's try a geeky sick joke. You've heard the one about death being
Mother Nature's way of telling you to slow down?  So, here are some
indicators that your function needs to be refactored into smaller chunks.

The three signs that your code is dying:

1. BRACKET MATCHING INSIDE THE FUNCTION IS DIFFICULT OR YOU MUST HAVE
COMMENTS AT THE ENDS OF BLOCKS TO BE ABLE TO SEE WHERE A BLOCK ENDS.

Or worse, to see where a block starts.  This can happen even with
consistent layout -- in huge functions.

Having automagic bracket matching in the editor helps. But when the end of 
a block is so far from the start that you cannot find it easily, you may 
want to consider pulling code out from the inside of the block and placing 
it into a separate function.  The function call overhead is miniscule 
compared to the headache and extra cost you incur when you screw up the 
code and it hoses production data. Think of creating a function at this 
point as a pink-slip preventative if you are a performance freak. That way 
you can justify to yourself doing something that has more long term benefit 
than shedding 750ms of runtime for a job with a 20 minute runtime. You 
probably already know that inlining by the compiler may undo your fear of 
increased overhead, anyway.

2. CODE HAS SO MANY INDENTATIONS THAT THE LINES ARE BUMPING HEADS WITH THE
RIGHT MARGIN.

Do not change the indentation spacing -- that is not the problem.

This is often the result of lots of if-then-else nesting or lots of nested
loops.

If the case is endless if-ing and switch-ing, you have a control flow 
issue. Usually, these kinds of issues can be resolved with a table, or 
maybe an in()-like function.

If you have loops nested five or six levels deep consider the possibility
that you have a program design problem.  Some types of math problems may
require this kind of nesting.  Unless the code is solving tensors think
about how you got where you are. It probably really is time to consider a
change in design.  Or at least algorithms.

3. THERE ARE SO MANY VARIABLES DECLARED IN THE FUNCTION YOU HAVE TROUBLE
RECALLING THEM ACCURATELY.

With the exception of a function that reads all of the columns of a badly 
designed database table, most functions don't require dozens and dozens of 
variables. Requiring lots of variables is usually a bad sign.  A sign that 
says you are trying to do way too many different, and probably unrelated, 
things in a function.  If you are reading a rotten data model, consider 
packing all of those nasties in one big struct or a logical group of 
structs.

So, at this point maybe you've decided want to think about function 
redesign to trim down an overweight function. Let's do a quick version of 
Function ReDesign 101.

There is a concept in function redesign - policy functions are unto 
themselves; do-it-well functions just do it.  Not understanding this 
concept often is the cause of giant functions. For example, when a single 
function is trying to obtain data, validate it and then summarize 
it all in the same code section, you create hard to manage functions.

A function should do one thing really well. If a requirement is policing 
(let's use validating input data as the scenario here) there should be a 
separate calling function whose main job is policy. The separate policy 
enforcement function decides what is or is not allowed. Does the program 
exit or just log an error when there is a problem?  Are numeric characters 
okay? Can a line have more than 128 characters?  The policy functions know
all the answers to these questions.

An example might be:

Let's assume that we have some functions in a report module. The report 
reads through gigabytes of garbage, sorts through it all and prints summary 
information. In the code there is a function called control_read_data() to 
check raw input for our report. It mostly calls read_on_a stream() and 
looks at what that function sends back. control_read_data() tells 
read_on_a_stream() how big the buffer is and what the handle for the stream 
is.  When it finds good stuff it calls summarize_report_data().

read_on_a_stream() is a happy little function that nulls out the buffer,
reads the stream, and checks stream errors and handles them. It does not care
what it just read. Or what the stream is. It just returns a pointer to the
defined-size buffer it reads data into, along with an EOF/error flag which
it sets or clears.

control_read_data() is designed to make decisions based on data it gets
from read_on_a_stream(). It does care deeply about what is or is not in the
data. And when it sees the EOF flag is set it returns to it's caller.  On
errors, it follows rules to decide what to do.

control_read_data() got its' base rule set from big_brother().  It applies
this rule set to sort out input data supplied by read_on_a_stream().

summarize_report_data() finds a category for the data it just got, then 
adds it to a subtotal.


The model is:

 big_brother() -provides base policy,

 control_read_data() -traffic cop to control reading input
                     data and summarizing input data,

 read_on_a_stream() -just I/O and I/O error checking,
 
 summarize_report_data() -sum by category

If you combine all the wonderfulness these functions provide into one giant
function, you lose. You've just seen what I mean by lose:  Increased fault
density, lower readability, and lower maintainability.

You may have seen the ideas in this article before today. The goal is to 
give you a more organized idea of layout styles for C source code, so you 
can set up your own style. You also should be able to easily identify 
problems in source code that make it hard to read, understand, and change. 
Even when it is your own absolutely perfect code.  You now know how to make 
choices to avoid problems proactively.

You will note that all the research in the bibliography is old.  We have 
known all this good stuff for a long time.  We just never seemed to do 
anything with it. Anything practical, that is. That is the reason this 
article was put together. There are very good general books on this topic, 
but nothing with a just a C slant.


Bibliography


Gorla N., A. C. Benander, B. A. Benander 1990. "Debugging Effort Estimation
Using Software Metrics", IEEE Transactions on Software Engineering, vol.
SE-16, No. 2, pp 221-231 1990.

Halstead M. H. 1977. "Elements of Software Science", Elsevier
North-Holland, Inc., New York, New York.

IEEE Std 1061-1998, Standards for A Software Quality Metrics Methodology.

Knuth, Donald, 1974. "Structured Programming with go to Statements",
Classics in Software Engineering, ed Edward Yourdan, Englewood Cliffs NJ,
Yourdan Press 1979.

McConnell Steve, 2004. "Code Complete", Microsoft Press, Redmond WA 2004.

Miaria, R. J.,et al 1983, "Program Indentation and Comprehensibility",
Communications of the ACM 26 No. 11, pp 861-7 November, 1983.

Sheil, B. A. 1981, "The Psychological Study of Programming.", Computing
Surveys 13, No. 1 pp 101-120 (March) 1981.

McCabe, Tom, 1976. "A Complexity Measure", IEEE Transactions on Software
Engineering, vol. SE-12, No. 4 pp 398-420, December, 1976.

Withrow, Carol, 1990. "Error Density and Size in Ada Software.", IEEE
Software, Vol. 7, No. 1, pp. 26-30, January, 1990.

Zuse, H., P. Bollman, 1989. "Software Metrics: Using Measurement
Theory to Describe the Properties and Scales of Static Software Complexity
Metrics", ACM SIGPLAN Notices Vol. 24, No. 8 (August 1989).
