Tuesday, April 15, 2008

Tips on designing a preprocessor for C++ with Antlr

Learn how to use Antlr to create a C++ preprocessor. Using this approach to create the C++ compiler, you don't need a separate preprocessor engine. Instead, the preprocessor engine can be integrated as part of the lexer.

One of the first steps in creating a custom compiler in C++ is developing the compiler’s preprocessor engine. Typically, C++ compilers have a separate preprocessing engine that accomplishes the following four basic tasks:

  • Defines and redefines macros (#define, #undef).
  • Supports header files (the #include directive).
  • Supports conditional compilation (#ifdef, #ifndef, #else, #endif).
  • Strips all comments from the input source listing.

Typically, the output of the preprocessor engine is then fed to the C++ lexer/parser combo. This article discusses the design of a C++ preprocessor using Antlr. You should have some familiarity with Antlr and the concept of top-down recursive descent parsers. All code discussed in this article is known to work on Antlr-2.7.2 and is compiled using gcc-3.4.4.

Create a single parser for the preprocessor

A salient benefit of using Antlr as your parser-generator tool of choice for creating the C++ compiler is that a separate preprocessor engine isn't necessary. The preprocessor engine may be integrated as part of the lexer. To understand this strategy, recall how lexers and parsers typically work. The lexer processes the raw data (in this case, the .h/.cpp files), creates tokens from this data, and passes the tokens to the parser. The parser in turn processes the tokens and validates against the grammar, does semantic checking, and ultimately generates assembly code.

The key to having a single compiler + preprocessor engine lies in proper modification of the lexer. The Antlr lexer is extended to preprocess the C++ sources and only pass the relevant tokens to the parser. The parser is never aware of the preprocessing part; it's only concerned with the tokens it receives from the lexer. Consider the following snippet:

#define USE_STREAMS
   
#ifdef USE_STREAMS 
#include <iostream>
#else
# include <stdio.h>
#endif 
   

 

Assume that this snippet is declared as part of a C++ source file. The lexer needs to pass on to the parser the first token it finds by including the iostream header. The #define mapping and #ifdef conditional evaluation are the added responsibilities of the lexer.

Enhance the Antlr lexer by defining new tokens

Lexers that work on C++ language define tokens as per the C++ language standard. For example, you usually find tokens that represent variable names and language keywords defined in an average lexer file.

To combine the preprocessing engine with a lexer, you must define new token types corresponding to the preprocessor constructs. On encountering these constructs, the lexer takes the necessary action. The lexer must also strip down comments from the sources. In this case, on encountering the token for a comment, the lexer needs to ignore the token and continue looking for the next available token. It's important to note that these tokens aren't defined as part of the language standard; they're essentially implementation-defined tokens.

Define new tokens in the lexer

The lexer must contain the following tokens, along with language-mandated ones: COMMENT_TOKEN, INCLUDE_TOKEN, DEFINE_TOKEN, UNDEF_TOKEN, IFDEF_TOKEN, ELSE_TOKEN, and ENDIF_TOKEN. This article discusses a prototype lexer that supports the preprocessor macros mentioned earlier and variable declaration for integer and long data types. The barebones lexer/parser combo that processes these tokens is shown in Listing 1.


Listing 1. A bare-bones lexer/parser that supports preprocessor tokens and long/int declarations
 
                
header {
#include "Main.hpp"
#include <string>
#include <iostream>
}

options {
            language = Cpp;
}

class cppParser extends Parser;

start: ( declaration )+ ;
declaration: type a:IDENTIFIER   (COMMA b:IDENTIFIER)* SEMI;
type: INT | LONG; 

{
#include <fstream>
#include "cppParser.hpp"
}

class cppLexer extends Lexer;

options { 
            charVocabulary = '\3'..'\377';
            k=5;
}

tokens { 
  INT="int";
  LONG="long";
}
           
DEFINE_TOKEN: "#define"  WS macroText:IDENTIFIER WS  macroArgs:MACRO_TEXT;
UNDEF_TOKEN: "#undef" IDENTIFIER;  
IFDEF_TOKEN: ("ifdef" | "#ifndef") WS IDENTIFIER;
ELSE_TOKEN: ("else" | "elsif" WS IDENTIFIER);            
ENDIF_TOKEN: "endif";
INCLUDE_TOKEN:  "#include" (WS)? f:STRING;
    
COMMENT_TOKEN: 
  (
  "//" (~'\n')* '\n' { newline( ); } 
   |
  "/*" (
       {LA(2) != '/'}? '*' | '\n' { newline( ); } | ~('*'|'\n')
       )*
  "*/"
  );

IDENTIFIER: ('a'..'z'|'A'..'Z'|'_')('a'..'z'|'A'..'Z'|'_'|'0'..'9')* ;
STRING: '"'! ( ~'"' )* '"'!
SEMI: ';';
COMMA: ',';

WS :  ( ' ' | '\t' | '\f' | '\n' {newline();})+;
            
MACRO_TEXT:  ( ~'\n' )*   ;
    

 

Define the lexer/parser combo to strip comments

Comments can be written in either the C++ // style or the C style /* */ multiline comment style. The lexer is extended to include a rule for the COMMENT_TOKEN. On encountering this token, the lexer needs to be explicitly told not to pass this token to the parser, but instead to skip it and continue looking for the next available token. You do this using Antlr's $setType function. You don't need to add support at the parser end, because it will never pass the comment token. See Listing 2.


Listing 2. Defining the lexer to strip comments from the C/C++ code
 
                
COMMENT_TOKEN: 
  (
  "//" (~'\n')* '\n' { newline( ); } 
   |
 "/*" (
           {LA(2) != '/'}? '*' | '\n' { newline( ); } | ~('*'|'\n')
         )*
  "*/"
  )
  { $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP); };
      

 

The code is simple enough for C++-style comments. For the C-style comment, note the {LA(2) != '/'}? used before matching *. It means that if there's a / after the *, then this rule isn't to be matched, and the lexer ends up matching */. This is necessary because /* this is */ an invalid comment */ is an invalid C++ comment. This also implies that the minimum look-ahead needed by the lexer is 2. LA(2) is known as a semantic predicate -- a hint to the lexer to decide what character to look for next in the input stream.



 

Include file processing

As discussed earlier, the parser needs to see a continuous stream of tokens. The lexer therefore needs to switch streams if and when it encounters an included file and switch back to the previous stream on encountering the end of the included file. The parser isn't aware of this stream switch and essentially treats included files as a continuous token stream. You can achieve this behavior several ways using Antlr; for a detailed discussion, see the Resources section.

This article uses Antlr's TokenStreamSelector class. Instead of initializing the parser with the lexer as in Listing 1, the parser is initialized with a TokenStreamSelector object. Internally, the TokenStreamSelector class maintains a stack of lexers. When the lexer encounters a new input stream, it creates a new lexer for processing of the new stream and subsequently attaches the new lexer to the tokenstreamselector object so that all future tokens (until the end of the new stream is reached) are sourced using the new lexer class. This is followed by invoking the retry method of the TokenStreamSelector class, which now sources tokens from the new input stream. The only other thing that remains to be done is to switch back to the previous input stream on encountering the EOF in an included file. To allow for user-defined actions on encountering the end of a line, Antlr provides the predefined routine uponEOF. You must modify this routine to switch to the previous input stream by calling TokenStreamSelector::pop(). Listing 3 shows a code snippet that includes file processing.


Listing 3. Initializing the parser with a TokenStreamSelector object
 
                
#include <iostream>
#include <fstream>
…
TokenStreamSelector selector;
cppParser* parser;
cppLexer* mainLexer;

int main(int argc,char** argv)
{
   try {
       std::ifstream inputstream("test.c", std::ifstream::in);
       mainLexer = new cppLexer(inputstream);
       // notify selector about starting lexer; name for convenience
       selector.addInputStream(mainLexer, "main");
       selector.select("main"); // start with main lexer

       // Create parser attached to selector
       parser = new cppParser (selector);
       parser->setFilename("test.c");
       parser->startRule();
  } 
  catch (exception& e) {
    cerr << "exception: " << e.what() << endl;
  }
  return 0;
} 
      

 

The lexer-related changes are described in Listing 4.


Listing 4. Include file processing in the lexer
 
                
class cppLexer extends Lexer;
…

{
public:
    void uponEOF()  {
        if ( selector.getCurrentStream() != mainLexer ) {
            selector.pop(); // return to old lexer/stream
            selector.retry();
        }
        else {
             ANTLR_USE_NAMESPACE(std)cout << "Hit EOF of main file\n"  ;
        }
    }
}

INCLUDE_TOKEN:  "#include" (WS)? f:STRING
    {
    ANTLR_USING_NAMESPACE(std)
    // create lexer to handle include
    string name = f->getText();
    ifstream* input = new ifstream(name.c_str());
    if (!*input) {
        cerr << "cannot find file " << name << endl;
    }
    cppLexer* sublexer = new cppLexer (*input);
    // make sure errors are reported in right file
    sublexer->setFilename(name);
    parser->setFilename(name);

    // push the previous lexer stream and make sublexer current
    selector.push(sublexer);
    // ignore this token, re-look for token in the switched stream
    selector.retry(); // throws TokenStreamRetryException
    }
    ;
    

 

Note that the TokenStreamSelector object is an intentional global variable because it needs to be shared as part of the lexer uponEOF method. Also, in the uponEOF method post, you must call the pop method retry so the TokenStreamSelector again looks into the current stream for the next token. Note that the include file processing described here isn't complete, but depends on conditional macro support . For example, if a source snippet includes a file within a #ifdef A .. #endif block where A isn't defined, then the processing for INCLUDE_TOKEN is discarded. This is further explained in the Handle macros and Conditional macro support sections.



 

Handle macros

Macro handling is primarily accomplished with the help of a hash table. This article uses the standard STL hash container. In its simplest form, support of macros implies supporting constructs like #define A and #undef A. You can easily do this by pushing the string "A" into a hash table on encountering #define and removing it from the hash on encountering #undef. The hash table is typically defined as part of the lexer. Also, note that the tokens for #define and #undef must not reach the parser -- for this reason, you add $setType(ANTLR_USE_NAMESPACE(Antlr)Token::SKIP); as part of the corresponding token processing. Listing 5 shows this behavior.


Listing 5. Supporting #define and #undef
 
                
class cppLexer extends Lexer;
…

{
bool processingMacro;
std::map<std::string, std::string> macroDefns;
public:
    void uponEOF()  {
    … // Code for include processing
    }
}
…
DEFINE_TOKEN: "#define" {processingMacro=true;} 
    WS macroText:IDENTIFIER WS macroArgs:MACRO_TEXT
  {
  macroDefns[macroText->getText()] = string(macroArgs->getText());
 $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
  } ;
                           
UNDEF_TOKEN: "#undef" WS macroText:IDENTIFIER
  {
  macroDefns.erase(macroText->getText());
  $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
  };

MACRO_TEXT: {processingMacro}? ( ~'\n' )* {processingMacro=false;};
      


 

Conditional macro support

Broadly speaking, supporting conditional macros means that you need to account for which tokens reach the parser, depending upon the conditional. For example, look at the following snippet:

#define A
#ifdef A
#ifdef B
int c;
#endif
int d;
#endif
      

 

On encountering the token for the first #ifdef, you must decide whether subsequent tokens that are encountered should be passed on to the parser, depending on whether A has been previously defined. At this stage, you use the hash table defined earlier.

Next, you must consider support of nested #ifdef blocks. Because you need to check for the path condition on encountering each #ifdef, it's logical to maintain a stack for the path conditions. Each #ifdef/#elsif adds the path condition to the stack head; the matching #endif removes the path condition from the stack.

Before you see the code that explains the concept in greater detail, you need to understand another important method in the Antlr lexer class: nextToken, which the parser calls continuously to retrieve the next token from the input stream. You can't modify this routine directly, so the optimal approach is to derive a class from the Antlr cppLexer class and redefine the nextToken method based on the path condition. If the path condition is true, the routine returns the next available token to the parser; otherwise, it continues until the matching #endif is found in the token stream.

Listing 6 shows the source for the derived lexer class. There should be no direct instantiation of cppLexer in the code; the parser should be initialized with a copy of the derived lexer class object.


Listing 6. Defining a new lexer class
 
                
class cppAdvancedLexer : public cppLexer 
  {
  public: 
    cppAdvancedLexer(ANTLR_USE_NAMESPACE(std)istream& in) : cppLexer(in) { }
    RefToken nextToken()
      {
      // keep looking for a token until you don't
      // get a retry exception
      for (;;) {
        try {
          RefToken _next = cppLexer::nextToken();
          if (processToken.empty() || processToken.top()) // defined in cppLexer
            return _next;
          }
        catch (TokenStreamRetryException& /*r*/) {
          // just retry "forever"
          }
        }
      }
  };
    

 

The lexer-related changes are shown in Listing 7.


Listing 7. Conditional macro support in the lexer
 
                
{
public:
    std::stack<bool> processToken;
    std::stack<bool> pathProcessed;
    std::map<std::string, std::list<std::string> > macroDefns;
    void uponEOF() {
    … // Code for include processing defined earlier
    }
}

IFDEF_TOKEN {bool negate = false; } :
  ("#ifdef" | "#ifndef" {negate = true;} ) WS macroText:IDENTIFIER
  {
  bool macroAlreadyDefined = false;
  if (macroDefns.find(macroText->getText()) != macroDefns.end())
    macroAlreadyDefined = true;
  processToken.push(negate? !macroAlreadyDefined : macroAlreadyDefined);
  pathProcessed(processToken.top());
  $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
  }
  ;
  
ELSE_TOKEN:  ("#else"
  {
   bool& pathCondition = processToken.top();
   pathCondition = !processToken.top(); // no other path is true
   } 
   | 
   "#elsif" WS macroText:IDENTIFIER
    {
     if (!processToken.top() && !pathProcessed.top())  {
       if (macroDefns.find(macroText->getText()) != macroDefns.end()) {
          processToken.push(true);
          pathProcessed.push(true);
       }
     }
     else {
        bool& condition = pathProcessed.top();
        condition = false;
     }
   )
   {
   $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
   }
   ;

ENDIF_TOKEN: "#endif"
   {
   processToken.pop();
   pathProcessed.pop();
   $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
   };

 

Examining the code, you see that processToken is defined as a stack of booleans to store the path conditions. On encountering #ifdef, if the code is already in true path, it tests whether the path condition is valid.


 

Support expressions as part of macros

You can use a macro to represent a complex expression that involves numbers, identifiers, mathematical operators, and even function calls. To do so, you must use the macroArgs string to replace every occurrence of the macro definition in the post-processed C++ code; the context determines whether this replacement is syntactically valid. For example, consider this situation: #define A b, c. Somewhere down the line you have int A;, which is valid C++; but switch(A) is clearly invalid C++. You need to parse the stream associated with the string on a per-occurrence basis and let the grammar rules determine the syntactic validity of the context.

The code for this approach works as follows:

  1. Capture the token corresponding to the macroText identifier in the cppAdvancedLexer::nextToken.
  2. Retrieve the macroArgs text from the hash table.
  3. Create an istream for macroArgs and a new copy of the lexer that parses this stream.
  4. Attach this lexer to the TokenStreamSelector object, and call retry -- this fetches the token from the new stream that was just created.

The uponEOF method is provided in cppLexer, which takes care of switching context on encountering the end of the stream. Listing 8 illustrates this discussion.


Listing 8. Modifying the nextToken method the lexer class for macro replacement
 
                
RefToken  cppAdvancedLexer::nextToken()
  {
  // keep looking for a token until you don't
  // get a retry exception
  for (;;) {
    try {
       RefToken _next = cppLexer::nextToken();
       map<string, string>::iterator defI = macroDefns.find(_next->getText());
       if (defI != macroDefns.end() && defI->second != "")
         {
         std::stringstream macroStream;
         cppAdvancedLexer* macrolexer = new cppAdvancedLexer(macroStream);
         macrolexer->setFilename(this->getFilename());
         selector.push(macrolexer);
         selector.retry();
         }
       if (processToken.empty() || processToken.top())
         return _next;
   }
   catch (TokenStreamRetryException& /*r*/) {
          // just retry "forever"
   }
}
            

 

Conclusion

This article discussed the basic methodology you can follow to create a C++ preprocessor using Antlr. Developing a full-fledged C++ preprocessor is beyond the scope of a single article. Although this article doesn't deal with potential errors and error-handling strategies, a real-world preprocessor must account for these factors; for example, the ordering of tokens in conditional macro processing is significant, and you must take steps to ensure that it's handled correctly.

No comments: