cpplib: Cleanup of option parsing

Tue Feb 8 13:09:00 GMT 2000

On Wed, Feb 09, 2000 at 12:46:46AM +0900, Neil Booth wrote:
> This patch implements strict option checking and is somewhat cleaner
> than the previous one.

This looks like the right idea, on a first read.  I have a few nits to
pick - see below.  Also, this can't go in until your copyright
assignment clears.  Until that happens, could you please focus on
testing?  There are some suggestions for how to do that on the webpage.

> The only thing it doesn't support that the old one does is multiple
> arguments after the -d option.  That capability was never documented
> anyway so it probably is not a problem.  If the old method is reqd a
> fix is easy.

I'm afraid the old method is required.  -d is also used by the
compiler proper, and it's expected to take a string of letters.  You
should treat the letters as an argument to -d, or something like
that.  Note also that you need to ignore -d suboptions that you don't
recognize, because they might be aimed at the compiler proper.

> +enum opt_code
> +{
> +  opt_fleading_underscore, opt_fno_leading_underscore,
> +  opt_fpreprocessed, opt_fno_preprocessed, opt_I, opt_o, opt_D,
> +  opt_pedantic, opt_pedantic_errors, opt_traditional, opt_trigraphs,
> +  opt_plus, opt_remap, opt_H, opt_v, opt_w, opt_h, opt_g, opt_dollar,
> +  opt__help, opt__version, opt_U, opt_A, opt_C, opt_E, opt_P,
> +  opt_include, opt_imacros, opt_iprefix, opt_iwithprefix,
> +  opt_iwithprefixbefore, opt_isystem, opt_idirafter, opt_ifoutput,
> +  opt_lang_c, opt_lang_cplusplus, opt_lang_c89, opt_lang_objc,
> +  opt_lang_objcplusplus, opt_lang_asm, opt_lang_fortran, opt_lang_chill,
> +  opt_std_c89, opt_std_c99, opt_std_c9x, opt_std_gnu89, opt_std_gnu99,
> +  opt_std_gnu9x, opt_std_iso9899_1990, opt_std_iso9899_199409,
> +  opt_std_iso9899_1999, opt_std_iso9899_199x,
> +  opt_Wtrigraphs, opt_Wno_trigraphs, opt_Wcomment, opt_Wcomments,
> +  opt_Wno_comment, opt_Wno_comments,
> +  opt_Wtraditional, opt_Wno_traditional, opt_Wundef, opt_Wno_undef,
> +  opt_Wimport, opt_Wno_import, opt_Werror, opt_Wno_error, opt_Wall,
> +  opt_nostdinc, opt_nostdincplusplus, opt_stdin_stdout,
> +  opt_M, opt_MD, opt_MG, opt_MM, opt_MMD,
> +  opt_dD, opt_dI, opt_dM, opt_dN
> +};

Please make this be in the same order as the cl_options table.
Otherwise updating it will be impossible.  Also, capitalize OPT, set
the first one to zero, and put an N_OPTS constant at the end.

> +struct cl_option
> +{
> +  char* opt_text;

const char *

> +  size_t opt_len;
> +  int takes_arg;
> +  enum opt_code opt_code;
> +  const char* missing_element;
> +};

I notice that the missing_element slot is used if and only if the
takes_arg flag is true.  Please merge those two elements.  That saves
memory, and prevents mistakes.  Also, put all the pointers before all
the integers - this saves padding space on platforms with 64-bit
pointers and 32-bit ints.  In this context, opt_len should be an
unsigned int; I know I said we need to use size_t more consistently,
but in this context, saving space is more important.  (Pretend it's a
size_t in the code, though.)

If you sort the opts enum correctly, the code will be equal to the
position in the table, and so could theoretically be removed, but I
think that will be too confusing and/or fragile.

> +/* This list MUST be sorted */
> +struct cl_option cl_options[] =

const struct cl_option

Yes, this does make a difference: the table will then go into the
read-only data segment.

> +  DEF_OPT("E",                         0, opt_E, 0),

Remove -E.  It should never get through to the preprocessor anymore.

> +  DEF_OPT("P",                         1, opt_P, 0),

-P does not take an argument.

> +  DEF_OPT("dD",                        0, opt_dD, 0),
> +  DEF_OPT("dI",                        0, opt_dI, 0),
> +  DEF_OPT("dM",                        0, opt_dM, 0),
> +  DEF_OPT("dN",                        0, opt_dN, 0),

See above.

> +static int parse_option(input)
> +     const char* input;
> +{
> +  unsigned int md, mn, mx, n_opts;
> +  size_t opt_len;
> +  int comp;
> +
> +  mn = 0;
> +  mx = n_opts = sizeof(cl_options) / sizeof(cl_options[0]);

Personal style rule: never apply sizeof to an object, only a type.
Use the N_OPTS constant mentioned above, instead.  (And then you don't
need the n_opts variable either.)

> +  while (mx > mn)
> +    {
> +      md = (mn + mx) / 2;
> +    
> +      opt_len = cl_options[md].opt_len;
> +      comp = strncmp(input, cl_options[md].opt_text, opt_len);
> +    
> +      if (comp == 0)
> +	{
> +	  /* Tricky.  If it takes an argument, and we have more text,
> +	     we may match a later or earlier option instead, for
> +	     example iwithprefix[before].  If it takes no arguments
> +	     and we have no more text, we must have matched. */
> +	  if (input[opt_len] == '\0')
> +	    return md;
> +	  if (cl_options[md].takes_arg)
> +	    {	/* Scan for an exact match */
> +	      for (mn = 0; mn < n_opts; mn++)
> +		if (!strcmp(input, cl_options[mn].opt_text))
> +		  return mn;
> +	      return md;
> +	    }
> +	  comp = 1;	/* Continue search */
> +	}
> +      if (comp > 0)
> +	mn = md + 1;
> +      else
> +	mx = md;
> +    }
> +
> +  return -1;
> +}

Under what circumstances can we hit the /* Continue search */ line but
not return -1?  If this cannot happen, you should pull that entire if
block out of the inner loop.

This does a binary search, but falls back to a linear search when it
matches on an option that takes an argument.  In normal use, -D, -I,
and -U are way more common than any others, and they all take args.  I
believe the problem you're trying to solve only comes up with the -i
switches, which are rarely used.  In other words, you've pessimized
the common case.  Try making "-i" the option and all its suboptions
arguments; we can live with one if-else-if block down in
cpp_handle_option.

There's also the character set issue.  GCC works on EBCDIC platforms
(e.g. i370-ibm-mvs), but your table is in ASCII collation order.  It
may be that for the subset we're looking at, EBCDIC collation order is
the same as ASCII, but I don't know that for sure.  (I'm particularly
concerned about '-' '-+' and '-$'.)  If this is not true, we need to
fix it somehow.  Two possibilities that come to mind are qsort()ing
the table at runtime on EBCDIC platforms, and using a plain linear
search and reordering options by frequency of occurrence.

zw