GPP - Generic Preprocessor

Version 2.1a - (c) Denis Auroux 1996-2002

gpp has a new maintainer, Tristan Miller (psychonaut@nothingisreal.com) and a new web page, http://www.nothingisreal.com/gpp/.

This version 2.1a is a dead branch in the gpp code tree; it is being released to avoid the loss of some unreleased patches to version 2.1. Its features will eventually merged into the new code maintained by Tristan Miller. From now on, all feature requests, bug-reports and patches should be submitted to him.

You can still download version 2.1a, or maybe better, the stable version which was used as a starting point for the new code tree, version 2.1. The web page where you can access the new versions can be found here, and the direct download link for the latest version is http://www.nothingisreal.com/gpp/gpp.tar.gz.


DESCRIPTION

gpp is a general-purpose preprocessor with customizable syntax, suitable for a wide range of preprocessing tasks. Its independence on any programming language makes it much more versatile than cpp, while its syntax is lighter and more flexible than that of m4.

gpp is targeted at all common preprocessing tasks where cpp is not suitable and where no very sophisticated features are needed. In order to be able to process equally efficiently text files or source code in a variety of languages, the syntax used by gpp is fully customizable. The handling of comments and strings is especially advanced.

Initially, gpp only understands a minimal set of built-in macros, called meta-macros. These meta-macros allow the definition of user macros as well as some basic operations forming the core of the preprocessing system, including conditional tests, arithmetic evaluation, and syntax specification. All user macro definitions are global, i.e. they remain valid until explicitly removed; meta-macros cannot be redefined. With each user macro definition gpp keeps track of the corresponding syntax specification so that a macro can be safely invoked regardless of any subsequent change in operating mode.

In addition to macros, gpp understands comments and strings, whose syntax and behavior can be widely customized to fit any particular purpose. Internally comments and strings are the same construction, so everything that applies to comments applies to strings as well.


SYNTAX

  gpp [-{o|O} outfile] [-I/include/path] [-Dname=val ...]
      [-z|+z] [-x] [-m] [-C|-T|-H|-P|-U ... [-M ...]]
      [-n|+n] [+c<n> str1 str2] [+s<n> str1 str2 c] 
      [-c str1] [-nostdinc] [-nocurinc] [-curdirinclast] 
      [-warninglevel n] [-includemarker str] [infile]


OPTIONS

gpp recognizes the following command-line switches and options:


SYNTAX SPECIFICATION

The syntax of a macro call is the following : it must start with a sequence of characters matching the macro start sequence as specified in the current mode, followed immediately by the name of the macro, which must be a valid identifier, i.e. a sequence of letters, digits, or underscores ("_"). The macro name must be followed by a short macro end sequence if the macro has no arguments, or by a sequence of arguments initiated by an argument start sequence. The various arguments are then separated by an argument separator, and the macro ends with a long macro end sequence.

In all cases, the parameters of the current context, i.e. the arguments passed to the body being evaluated, can be referred to by using an argument reference sequence followed by a digit between 1 and 9. Macro parameters may alternately be named (see below). Furthermore, to avoid interference between the gpp syntax and the contents of the input file a quote character is provided. The quote character can be used to prevent the interpretation of a macro call, comment, or string as anything but plain text. The quote character "protects" the following character, and always gets removed during evaluation. Two consecutive quote characters evaluate as a single quote character.

Finally, to facilitate proper argument delimitation, certain characters can be "stacked" when they occur in a macro argument, so that the argument separator or macro end sequence are not parsed if the argument body is not balanced. This allows nesting macro calls without using quotes. If an improperly balanced argument is needed, quote characters should be added in front of some stacked characters to make it balanced.

The macro construction sequences described above can be different for meta-macros and for user macros: this is e.g. the case in cpp mode. Note that, since meta-macros can only have up to two arguments, the delimitation rules for the second argument are somewhat sloppier, and unquoted argument separator sequences are allowed in the second argument of a meta-macro.

Unless one of the standard operating modes is selected, the above syntax sequences can be specified either on the command-line, using the -M and -U options respectively for meta-macros and user macros, or inside an input file via the #mode meta and #mode user meta-macro calls. In both cases the mode description consists of 9 parameters for user macro specifications, namely the macro start sequence, the short macro end sequence, the argument start sequence, the argument separator, the long macro end sequence, the string listing characters to stack, the string listing characters to unstack, the argument reference sequence, and finally the quote character. As explained below these sequences should be supplied using the syntax of C strings; they must start with a non-alphanumeric character, and in the first five strings special matching sequences can be used (see below). If the argument corresponding to the quote character is the empty string that functionality is disabled. For meta-macro specifications there are only 7 parameters, as the argument reference sequence and quote character are shared with the user macro syntax.

The structure of a comment/string is the following : it must start with a sequence of characters matching the given comment/string start sequence, and always ends at the first occurrence of the comment/string end sequence, unless it is preceded by an odd number of occurrences of the string-quote character (if such a character has been specified). In certain cases comment/strings can be specified to enable macro evaluation inside the comment/string: in that case, if a quote character has been defined for macros it can be used as well to prevent the comment/string from ending, with the difference that the macro quote character is always removed from output whereas the string-quote character is always output. Also note that under certain circumstances a comment/string specification can be disabled, in which case the comment/string start sequence is simply ignored. Finally, it is possible to specify a string warning character whose presence inside a comment/string will cause gpp to output a warning (this is useful e.g. to locate unterminated strings in cpp mode). Note that input files are not allowed to contain unterminated comments/strings.

A comment/string specification can be declared from within the input file using the #mode comment meta-macro call (or equivalently #mode string), in which case the number of C strings to be given as arguments to describe the comment/string can be anywhere between 2 and 4: the first two arguments (mandatory) are the start sequence and the end sequence, and can make use of the special matching sequences (see below). They may not start with alphanumeric characters. The first character of the third argument, if there is one, is used as string-quote character (use an empty string to disable the functionality), and the first character of the fourth argument, if there is one, is used as string-warning character. A specification may also be given from the command-line, in which case there must be two arguments if using the +c option and three if using the +s option.

The behavior of a comment/string is specified by a three-character modifier string, which may be passed as an optional argument either to the +c/+s command-line options or to the #mode comment/#mode string meta-macros. If no modifier string is specified, the default value is "ccc" for comments and "sss" for strings. The first character corresponds to the behavior inside meta-macro calls (including user-macro definitions since these come inside a #define meta-macro call), the second character corresponds to the behavior inside user-macro parameters, and the third character corresponds to the behavior outside of any macro call. Each of these characters can take the following values:

Important note: any occurrence of a comment/string start sequence inside another comment/string is always ignored, even if macro evaluation is enabled. In other words, comments/strings cannot be nested. In particular, the 'Q' modifier can be a convenient way of defining a syntax for temporarily disabling all comment and string specifications.

Syntax specification strings should always be provided as C strings, whether they are given as arguments to a #mode meta-macro call or on the command-line of a Unix shell. If command-line arguments are given via another method than a standard Unix shell, then the shell behavior must be emulated, i.e. the surrounding "" quotes should be removed, all occurrences of '\\' should be replaced by a single backslash, and similarly '\"' should be replaced by '"'. Sequences like '\n' are recognized by gpp and should be left as is.

Special sequences matching certain subsets of the character set can be used. They are of the form '\x', where x is one of:

Moreover, all of these matching subsets except '\w' and '\W' can be negated by inserting a '!', i.e. by writing '\!x' instead of '\x'.

Note an important distinctive feature of start sequences: when the first character of a macro or comment/string start sequence is ' ' or one of the above special sequences, it is not taken to be part of the sequence itself but is used instead as a context check: for example a start sequence beginning with '\n' matches only at the beginning of a line, but the matching newline character is not taken to be part of the sequence. Similarly a start sequence beginning with ' ' matches only if some whitespace is present, but the matching whitespace is not considered to be part of the start sequence and is therefore sent to output. If a context check is performed at the very beginning of a file (or more generally of any body to be evaluated), the result is the same as matching with a newline character (this makes it possible for a cpp-mode file to start with a meta-macro call).

Two special syntax rules have been added in version 2.1. First, argument references (#n) are no longer evaluated when they are outside of macro calls and definitions. However, they are no longer allowed to appear (unless protected by quote characters) inside a call to a defined user macro; the current behavior (backwards compatible) is to remove them silently from the input if that happens.

Second, if the end sequence (either for macros or comments) consists of a single newline character, and if delimitation rules lead to evaluation in a context where the final newline character is absent, gpp silently ignores the missing newline instead of producing an error. The main consequence is that meta-macro calls can now be nested in a simple way in standard, cpp and Prolog modes.


EVALUATION RULES

Input is read sequentially and interpreted according to the rules of the current mode. All input text is first matched against the specified comment/string start sequences of the current mode (except those which are disabled by the 'i' modifier), unless the body being evaluated is the contents of a comment/string whose modifier enables macro evaluation. The most recently defined comment/string specifications are checked for first. Important note: comments may not appear between the name of a macro and its arguments (doing so results in undefined behavior).

Anything that is not a comment/string is then matched against a possible meta-macro call, and if that fails too, against a possible user-macro call. All remaining text undergoes substitution of argument reference sequences by the relevant argument text (empty unless the body being evaluated is the definition of a user macro) and removal of the quote character if there is one.

Note that meta-macro arguments are passed to the meta-macro prior to any evaluation (although the meta-macro may choose to evaluate them, see meta-macro descriptions below). In the case of the #mode meta-macro, gpp temporarily adds a comment/string specification to enable recognition of C strings ("...") and prevent any evaluation inside them, so no interference of the characters being put in the C string arguments to #mode with the current syntax is to be feared.

On the other hand, the arguments to a user macro are systematically evaluated, and then passed as context parameters to the macro definition body, which gets evaluated with that environment. The only exception is when the macro definition is empty, in which case its arguments are not evaluated. Note that gpp temporarily switches back to the mode in which the macro was defined in order to evaluate it: so it is perfectly safe to change the operating mode between the time when a macro is defined and the time when it is called. Conversely, if a user macro wishes to work with the current mode instead of the one that was used to define it it needs to start with a #mode restore call and end with a #mode save call.

A user macro may be defined with named arguments (see #define description below). In that case, when the macro definition is being evaluated, each named parameter causes a temporary virtual user-macro definition to be created; such a macro may only be called without arguments and simply returns the text of the corresponding argument.

Note that, since macros are evaluated when they are called rather than when they are defined, any attempt to call a recursive macro causes undefined behavior except in the very specific case when the macro uses #undef to erase itself after finitely many loop iterations.

Finally, a special case occurs when a user macro whose definition does not involve any arguments (neither named arguments nor the argument reference sequence) is called in a mode where the short user-macro end sequence is empty (e.g. cpp or TeX mode). In that case it is assumed to be an alias macro: its arguments are first evaluated in the current mode as usual, but instead of being passed to the macro definition as parameters (which would cause them to be discarded) they are actually appended to the macro definition, using the syntax rules of the mode in which the macro was defined, and the resulting text is evaluated again. It is therefore important to note that, in the case of a macro alias, the arguments actually get evaluated twice in two potentially different modes.


META-MACROS

These macros are always pre-defined. Their actual calling sequence depends on the current mode; here we use cpp-like notation.

The key to gpp's flexibility is the #mode meta-macro. Its first argument is always one of a list of available keywords (see below); its second argument is always a sequence of words separated by whitespace. Apart from possibly the first of them, each of these words is always a delimiter or syntax specifier, and should be provided as a C string delimited by double quotes (" "). The various special matching sequences listed in the section on syntax specification are available. Any #mode command is parsed in a mode where "..." is understood to be a C-style string, so it is safe to put any character inside these strings. Also note that the first argument of #mode (the keyword) is never evaluated, while the second argument is evaluated (except of course for the contents of C strings), so that the syntax specification may be obtained as the result of a macro evaluation.

The available #mode commands are:


EXAMPLES

Here is a basic self-explanatory example in standard or cpp mode:
  #define FOO This is
  #define BAR a message.
  #define concat #1 #2
  concat(FOO,BAR)
  #ifeq (concat(foo,bar)) (foo bar)
  This is output.
  #else
  This is not output.
  #endif
Using argument naming, the concat macro could alternately be defined as
  #define concat(x,y) x y
In TeX mode and using argument naming, the same example becomes:
  \define{FOO}{This is}
  \define{BAR}{a message.}
  \define{\concat{x}{y}}{\x \y}
  \concat{\FOO}{\BAR}
  \ifeq{\concat{foo}{bar}}{foo bar}
  This is output.
  \else
  This is not output.
  \endif
In HTML mode and without argument naming, one gets similarly:
  <#define FOO|This is>
  <#define BAR|a message.>
  <#define concat|#1 #2>
  <#concat <#FOO>|<#BAR>>
  <#ifeq <#concat foo|bar>|foo bar>
  This is output.
  <#else>
  This is not output.
  <#endif>
The following example (in standard mode) illustrates the use of the quote character:
  #define FOO This is \
     a multiline definition.
  #define BLAH(x) My argument is x
  BLAH(urf)
  \BLAH(urf)
Note that the multiline definition is also valid in cpp and Prolog modes despite the absence of quote character, because '\' followed by a newline is then interpreted as a comment and discarded.

In cpp mode, C strings and comments are understood as such, as illustrated by the following example:

  #define BLAH foo
  BLAH "BLAH" /* BLAH */
  'It\'s a /*string*/ !'
The main difference between Prolog mode and cpp mode is the handling of strings and comments: in Prolog, a '...' string may not begin immediately after a digit, and a /*...*/ comment may not begin immediately after an operator character. Furthermore, comments are not removed from the output unless they occur in a #command.

The differences between cpp mode and default mode are deeper: in default mode #commands may start anywhere, while in cpp mode they must be at the beginning of a line; the default mode has no knowledge of comments and strings, but has a quote character ('\'), while cpp mode has extensive comment/string specifications but no quote character. Moreover, the arguments to meta-macros need to be correctly parenthesized in default mode, while no such checking is performed in cpp mode.

This makes it easier to nest meta-macro calls in default mode than in cpp mode. For example, consider the following HTML mode input, which tests for the availability of the #exec command:

  <#ifeq <#exec echo blah>|blah
  > #exec allowed <#else> #exec not allowed <#endif>
There is no cpp mode equivalent, while in default mode it can be easily translated as
  #ifeq (#exec echo blah
  ) (blah
  )
  \#exec allowed
  #else
  \#exec not allowed
  #endif
In order to nest meta-macro calls in cpp mode it is necessary to modify the mode description, either by changing the meta-macro call syntax, or more elegantly by defining a silent string and using the fact that the context at the beginning of an evaluated string is a newline character:
  #mode string QQQ "$" "$"
  #ifeq $#exec echo blah
  $ $blah
  $
  \#exec allowed
  #else
  \#exec not allowed
  #endif
Note however that comments/strings cannot be nested ("..." inside $...$ would go undetected), so one needs to be careful about what to include inside such a silent evaluated string. In this example, the loose meta-macro nesting introduced in version 2.1 makes it possible to use the following simpler version:
  #ifeq blah #exec echo -n blah
  \#exec allowed
  #else
  \#exec not allowed
  #endif
Remember that macros without arguments are actually understood to be aliases when they are called with arguments, as illustrated by the following example (default or cpp mode):
  #define DUP(x) x x
  #define FOO and I said: DUP
  FOO(blah)
The usefulness of the #defeval meta-macro is shown by the following example in HTML mode:
  <#define APPLY|<#defeval TEMP|<\##1 \#1>><#TEMP #2>>
  <#define <#foo x>|<#x> and <#x>>
  <#APPLY foo|BLAH>
The reason why #defeval is needed is that, since everything is evaluated in a single pass, the input that will result in the desired macro call needs to be generated by a first evaluation of the arguments passed to APPLY before being evaluated a second time.

To translate this example in default mode, one needs to resort to parenthesizing in order to nest the #defeval call inside the definition of APPLY, but need to do so without outputting the parentheses. The easiest solution is

  #define BALANCE(x) x
  #define APPLY(f,v) BALANCE(#defeval TEMP f
  TEMP(v))
  #define foo(x) x and x
  APPLY(\foo,BLAH)
As explained above the simplest version in cpp mode relies on defining a silent evaluated string to play the role of the BALANCE macro.

The following example (default or cpp mode) demonstrates arithmetic evaluation:

  #define x 4
  The answer is:
  #eval x*x + 2*(16-x) + 1998%x

  #if defined(x)&&!(3*x+5>17)
  This should be output.
  #endif
To finish, here are some examples involving mode switching. The following example is self-explanatory (starting in default mode):
  #mode push
  #define f(x) x x
  #mode standard TeX
  \f{blah}
  \mode{string}{"$" "$"}
  \mode{comment}{"/*" "*/"}
  $\f{urf}$ /* blah */
  \define{FOO}{bar/* and some more */}
  \mode{pop}
  f($FOO$)
A good example where a user-defined mode becomes useful is the gpp source of this document (available with gpp's source code distribution).

Another interesting application is selectively forcing evaluation of macros in C strings when in cpp mode. For example, consider the following input:

  #define blah(x) "and he said: x"
  blah(foo)
Obviously one would want the parameter x to be expanded inside the string. There are several ways around this problem:
  #mode push
  #mode nostring "\""
  #define blah(x) "and he said: x"
  #mode pop

  #mode quote "`"
  #define blah(x) `"and he said: x`"

  #mode string QQQ "$$" "$$"
  #define blah(x) $$"and he said: x"$$
The first method is very natural, but has the inconvenient of being lengthy and neutralizing string semantics, so that having an unevaluated instance of 'x' in the string, or an occurrence of '/*', would be impossible without resorting to further contorsions.

The second method is slightly more efficient, because the local presence of a quote character makes it easier to control what is evaluated and what isn't, but has the drawback that it is sometimes impossible to find a reasonable quote character without having to either significantly alter the source file or enclose it inside a #mode push/pop construct. For example any occurrence of '/*' in the string would have to be quoted.

The last method demonstrates the efficiency of evaluated strings in the context of selective evaluation: since comments/strings cannot be nested, any occurrence of '"' or '/*' inside the '$$' gets output as plain text, as expected inside a string, and only macro evaluation is enabled. Also note that there is much more freedom in the choice of a string delimiter than in the choice of a quote character.

Starting with version 2.1, meta-macro calls can be nested more efficiently in default, cpp and Prolog modes. This makes it easy e.g. to make a user version of a meta-macro, or to increment a counter :

  #define myeval #eval #1

  #define x 1
  #defeval x #eval x+1


ADVANCED EXAMPLES

Here are some examples of advanced constructions using gpp. They tend to be pretty awkward and should be considered as evidence of gpp's limitations.

The first example is a recursive macro. The main problem is that, since gpp evaluates everything, a recursive macro must be very careful about the way in which recursion is terminated, in order to avoid undefined behavior (most of the time gpp will simply crash). In particular, relying on a #if/#else/#endif construct to end recursion is not possible and results in an infinite loop, because gpp scans user macro calls even in the unevaluated branch of the conditional block. A safe way to proceed is for example as follows (we give the example in TeX mode):

  \define{countdown}{
    \if{#1}
    #1...
    \define{loop}{\countdown}
    \else
    Done.
    \define{loop}{}
    \endif
    \loop{\eval{#1-1}}
  }
  \countdown{10}

Another example, in cpp mode:

  #mode string QQQ "$" "$"
  #define triangle(x,y) y \
   $#if length(y)<x$ $#define iter triangle$ $#else$ \
   $#define iter$ $#endif
  $ iter(x,*y)
  triangle(20)

The following is an (unfortunately very weak) attempt at implementing functional abstraction in gpp (in standard mode). Understanding this example and why it can't be made much simpler is an exercise left to the curious reader.

  #mode string "`" "`" "\\"
  #define ASIS(x) x
  #define SILENT(x) ASIS()
  #define EVAL(x,f,v) SILENT(
    #mode string QQQ "`" "`" "\\"
    #defeval TEMP0 x
    #defeval TEMP1 (
      \#define \TEMP2(TEMP0) f
    )
    TEMP1
    )TEMP2(v)
  #define LAMBDA(x,f,v) SILENT(
    #ifneq (v) ()
    #define TEMP3(a,b,c) EVAL(a,b,c)
    #else
    #define TEMP3(a,b,c) \LAMBDA(a,b)
    #endif
    )TEMP3(x,f,v)
  #define EVALAMBDA(x,y) SILENT(
    #defeval TEMP4 x
    #defeval TEMP5 y
    ) 
  #define APPLY(f,v) SILENT(
    #defeval TEMP6 ASIS(\EVA)f
    TEMP6
    )EVAL(TEMP4,TEMP5,v)
This yields the following results:
  LAMBDA(z,z+z)
    => LAMBDA(z,z+z)

  LAMBDA(z,z+z,2)
    => 2+2

  #define f LAMBDA(y,y*y)
  f
    => LAMBDA(y,y*y)

  APPLY(f,blah)
    => blah*blah

  APPLY(LAMBDA(t,t t),(t t))
    => (t t) (t t)

  LAMBDA(x,APPLY(f,(x+x)),urf)
    => (urf+urf)*(urf+urf)

  APPLY(APPLY(LAMBDA(x,LAMBDA(y,x*y)),foo),bar)
    => foo*bar

  #define test LAMBDA(y,`#ifeq y urf
  y is urf#else
  y is not urf#endif
  `)
  APPLY(test,urf)
    => urf is urf

  APPLY(test,foo)
    => foo is not urf


AUTHOR

Denis Auroux, e-mail: auroux@math.polytechnique.fr.

Please send me e-mail for any comments, questions or suggestions.

Many thanks to Michael Kifer for valuable feedback and suggestions, and for contributing various patches included in this version.

Please contact the new maintainer, Tristan Miller (psychonaut@nothingisreal.com), for feature requests, bug-reports and suggestions. The new web page of gpp is http://www.nothingisreal.com/gpp/