Learn how to use Antlr to create a C++ preprocessor.
Using this approach to create the C++ compiler, you
don't need a separate preprocessor engine. Instead, the
preprocessor engine can be integrated as part of the
lexer.
One of the first steps in creating a custom compiler in
C++ is developing the compiler’s preprocessor engine.
Typically, C++ compilers have a separate preprocessing
engine that accomplishes the following four basic tasks:
- Defines and redefines macros (
#define ,
#undef ).
- Supports header files (the
#include
directive).
- Supports conditional compilation (
#ifdef ,
#ifndef , #else , #endif ).
- Strips all comments from the input source listing.
Typically, the output of the preprocessor engine is then
fed to the C++ lexer/parser combo. This article discusses
the design of a C++ preprocessor using Antlr. You should
have some familiarity with Antlr and the concept of top-down
recursive descent parsers. All code discussed in this
article is known to work on Antlr-2.7.2 and is compiled
using gcc-3.4.4.
Create a single
parser for the preprocessor
A salient benefit of using Antlr as your parser-generator
tool of choice for creating the C++ compiler is that a
separate preprocessor engine isn't necessary. The
preprocessor engine may be integrated as part of the lexer.
To understand this strategy, recall how lexers and parsers
typically work. The lexer processes the raw data (in this
case, the .h/.cpp files), creates tokens from this data, and
passes the tokens to the parser. The parser in turn
processes the tokens and validates against the grammar, does
semantic checking, and ultimately generates assembly code.
The key to having a single compiler + preprocessor engine
lies in proper modification of the lexer. The Antlr lexer is
extended to preprocess the C++ sources and only pass the
relevant tokens to the parser. The parser is never aware of
the preprocessing part; it's only concerned with the tokens
it receives from the lexer. Consider the following snippet:
#define USE_STREAMS
#ifdef USE_STREAMS
#include <iostream>
#else
# include <stdio.h>
#endif
|
Assume that this snippet is declared as part of a C++ source file. The
lexer needs to pass on to the parser the first token it
finds by including the iostream header. The
#define mapping and #ifdef
conditional evaluation are the added responsibilities of the
lexer.
Enhance the Antlr
lexer by defining new tokens
Lexers that work on C++ language define tokens as per the
C++ language standard. For example, you usually find tokens
that represent variable names and language keywords defined
in an average lexer file.
To combine the preprocessing engine with a lexer, you
must define new token types corresponding to the
preprocessor constructs. On encountering these constructs,
the lexer takes the necessary action. The lexer must also
strip down comments from the sources. In this case, on
encountering the token for a comment, the lexer needs to
ignore the token and continue looking for the next available
token. It's important to note that these tokens aren't
defined as part of the language standard; they're
essentially implementation-defined tokens.
Define new tokens
in the lexer
The lexer must contain the following tokens, along with
language-mandated ones: COMMENT_TOKEN ,
INCLUDE_TOKEN , DEFINE_TOKEN ,
UNDEF_TOKEN , IFDEF_TOKEN ,
ELSE_TOKEN , and ENDIF_TOKEN . This
article discusses a prototype lexer that supports the
preprocessor macros mentioned earlier and variable
declaration for integer and long data types. The barebones
lexer/parser combo that processes these tokens is shown in
Listing 1.
Listing 1. A bare-bones lexer/parser
that supports preprocessor tokens and long/int declarations
header {
#include "Main.hpp"
#include <string>
#include <iostream>
}
options {
language = Cpp;
}
class cppParser extends Parser;
start: ( declaration )+ ;
declaration: type a:IDENTIFIER (COMMA b:IDENTIFIER)* SEMI;
type: INT | LONG;
{
#include <fstream>
#include "cppParser.hpp"
}
class cppLexer extends Lexer;
options {
charVocabulary = '\3'..'\377';
k=5;
}
tokens {
INT="int";
LONG="long";
}
DEFINE_TOKEN: "#define" WS macroText:IDENTIFIER WS macroArgs:MACRO_TEXT;
UNDEF_TOKEN: "#undef" IDENTIFIER;
IFDEF_TOKEN: ("ifdef" | "#ifndef") WS IDENTIFIER;
ELSE_TOKEN: ("else" | "elsif" WS IDENTIFIER);
ENDIF_TOKEN: "endif";
INCLUDE_TOKEN: "#include" (WS)? f:STRING;
COMMENT_TOKEN:
(
"//" (~'\n')* '\n' { newline( ); }
|
"/*" (
{LA(2) != '/'}? '*' | '\n' { newline( ); } | ~('*'|'\n')
)*
"*/"
);
IDENTIFIER: ('a'..'z'|'A'..'Z'|'_')('a'..'z'|'A'..'Z'|'_'|'0'..'9')* ;
STRING: '"'! ( ~'"' )* '"'!
SEMI: ';';
COMMA: ',';
WS : ( ' ' | '\t' | '\f' | '\n' {newline();})+;
MACRO_TEXT: ( ~'\n' )* ;
|
Define the lexer/parser combo to
strip comments
Comments can be written in either the C++ //
style or the C style /* */ multiline comment
style. The lexer is extended to include a rule for the
COMMENT_TOKEN . On encountering this token, the lexer
needs to be explicitly told not to pass this token to the
parser, but instead to skip it and continue looking for the
next available token. You do this using Antlr's
$setType function. You don't need to add support at
the parser end, because it will never pass the comment
token. See Listing 2.
Listing 2. Defining the lexer to strip
comments from the C/C++ code
COMMENT_TOKEN:
(
"//" (~'\n')* '\n' { newline( ); }
|
"/*" (
{LA(2) != '/'}? '*' | '\n' { newline( ); } | ~('*'|'\n')
)*
"*/"
)
{ $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP); };
|
The code is simple enough for C++-style comments. For the C-style
comment, note the {LA(2) != '/'}? used before
matching * . It means that if there's a /
after the * , then this rule isn't to be
matched, and the lexer ends up matching */ .
This is necessary because /* this is */ an invalid
comment */ is an invalid C++ comment. This also
implies that the minimum look-ahead needed by the lexer is
2. LA(2) is known as a semantic predicate -- a
hint to the lexer to decide what character to look for next
in the input stream.
Include file processing
As discussed earlier, the parser needs to see a
continuous stream of tokens. The lexer therefore needs to
switch streams if and when it encounters an included file
and switch back to the previous stream on encountering the
end of the included file. The parser isn't aware of this
stream switch and essentially treats included files as a
continuous token stream. You can achieve this behavior
several ways using Antlr; for a detailed discussion, see the
Resources
section.
This article uses Antlr's TokenStreamSelector
class. Instead of initializing the parser with the lexer as
in Listing 1,
the parser is initialized with a TokenStreamSelector
object. Internally, the TokenStreamSelector
class maintains a stack of lexers. When the lexer encounters
a new input stream, it creates a new lexer for processing of
the new stream and subsequently attaches the new lexer to
the tokenstreamselector object so that all
future tokens (until the end of the new stream is reached)
are sourced using the new lexer class. This is followed by
invoking the retry method of the
TokenStreamSelector class, which now sources tokens
from the new input stream. The only other thing that remains
to be done is to switch back to the previous input stream on
encountering the EOF in an included file. To allow for
user-defined actions on encountering the end of a line,
Antlr provides the predefined routine uponEOF .
You must modify this routine to switch to the previous input
stream by calling TokenStreamSelector::pop() .
Listing 3 shows a code snippet that includes file
processing.
Listing 3. Initializing the parser
with a TokenStreamSelector object
#include <iostream>
#include <fstream>
…
TokenStreamSelector selector;
cppParser* parser;
cppLexer* mainLexer;
int main(int argc,char** argv)
{
try {
std::ifstream inputstream("test.c", std::ifstream::in);
mainLexer = new cppLexer(inputstream);
// notify selector about starting lexer; name for convenience
selector.addInputStream(mainLexer, "main");
selector.select("main"); // start with main lexer
// Create parser attached to selector
parser = new cppParser (selector);
parser->setFilename("test.c");
parser->startRule();
}
catch (exception& e) {
cerr << "exception: " << e.what() << endl;
}
return 0;
}
|
The lexer-related changes are described in Listing 4.
Listing 4. Include file processing in
the lexer
class cppLexer extends Lexer;
…
{
public:
void uponEOF() {
if ( selector.getCurrentStream() != mainLexer ) {
selector.pop(); // return to old lexer/stream
selector.retry();
}
else {
ANTLR_USE_NAMESPACE(std)cout << "Hit EOF of main file\n" ;
}
}
}
INCLUDE_TOKEN: "#include" (WS)? f:STRING
{
ANTLR_USING_NAMESPACE(std)
// create lexer to handle include
string name = f->getText();
ifstream* input = new ifstream(name.c_str());
if (!*input) {
cerr << "cannot find file " << name << endl;
}
cppLexer* sublexer = new cppLexer (*input);
// make sure errors are reported in right file
sublexer->setFilename(name);
parser->setFilename(name);
// push the previous lexer stream and make sublexer current
selector.push(sublexer);
// ignore this token, re-look for token in the switched stream
selector.retry(); // throws TokenStreamRetryException
}
;
|
Note that the TokenStreamSelector object is an intentional
global variable because it needs to be shared as part of the
lexer uponEOF method. Also, in the
uponEOF method post, you must call the pop method
retry so the TokenStreamSelector
again looks into the current stream for the next token. Note
that the include file processing described here isn't
complete, but depends on conditional macro support . For
example, if a source snippet includes a file within a
#ifdef A .. #endif block where A isn't
defined, then the processing for INCLUDE_TOKEN
is discarded. This is further explained in the
Handle macros and
Conditional
macro support sections.
Handle macros
Macro handling is primarily accomplished with the help of
a hash table. This article uses the standard STL hash
container. In its simplest form, support of macros implies
supporting constructs like #define A and
#undef A . You can easily do this by pushing the
string "A" into a hash table on encountering #define
and removing it from the hash on encountering #undef .
The hash table is typically defined as part of the lexer.
Also, note that the tokens for #define and
#undef must not reach the parser -- for this
reason, you add
$setType(ANTLR_USE_NAMESPACE(Antlr)Token::SKIP); as
part of the corresponding token processing. Listing 5 shows
this behavior.
Listing 5. Supporting #define and
#undef
class cppLexer extends Lexer;
…
{
bool processingMacro;
std::map<std::string, std::string> macroDefns;
public:
void uponEOF() {
… // Code for include processing
}
}
…
DEFINE_TOKEN: "#define" {processingMacro=true;}
WS macroText:IDENTIFIER WS macroArgs:MACRO_TEXT
{
macroDefns[macroText->getText()] = string(macroArgs->getText());
$setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
} ;
UNDEF_TOKEN: "#undef" WS macroText:IDENTIFIER
{
macroDefns.erase(macroText->getText());
$setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
};
MACRO_TEXT: {processingMacro}? ( ~'\n' )* {processingMacro=false;};
|
Conditional macro support
Broadly speaking, supporting conditional macros means
that you need to account for which tokens reach the parser,
depending upon the conditional. For example, look at the
following snippet:
#define A
#ifdef A
#ifdef B
int c;
#endif
int d;
#endif
|
On encountering the token for the first #ifdef , you must
decide whether subsequent tokens that are encountered should
be passed on to the parser, depending on whether A
has been previously defined. At this stage, you use the hash
table defined earlier.
Next, you must consider support of nested #ifdef
blocks. Because you need to check for the path condition on
encountering each #ifdef , it's logical to
maintain a stack for the path conditions. Each #ifdef /#elsif
adds the path condition to the stack head; the matching
#endif removes the path condition from the
stack.
Before you see the code that explains the concept in
greater detail, you need to understand another important
method in the Antlr lexer class: nextToken ,
which the parser calls continuously to retrieve the next
token from the input stream. You can't modify this routine
directly, so the optimal approach is to derive a class from
the Antlr cppLexer class and redefine the
nextToken method based on the path condition. If the
path condition is true, the routine returns the next
available token to the parser; otherwise, it continues until
the matching #endif is found in the token
stream.
Listing 6 shows the source for the derived lexer class.
There should be no direct instantiation of cppLexer
in the code; the parser should be initialized with a copy of
the derived lexer class object.
Listing 6. Defining a new lexer class
class cppAdvancedLexer : public cppLexer
{
public:
cppAdvancedLexer(ANTLR_USE_NAMESPACE(std)istream& in) : cppLexer(in) { }
RefToken nextToken()
{
// keep looking for a token until you don't
// get a retry exception
for (;;) {
try {
RefToken _next = cppLexer::nextToken();
if (processToken.empty() || processToken.top()) // defined in cppLexer
return _next;
}
catch (TokenStreamRetryException& /*r*/) {
// just retry "forever"
}
}
}
};
|
The lexer-related changes are shown in Listing 7.
Listing 7. Conditional macro support
in the lexer
{
public:
std::stack<bool> processToken;
std::stack<bool> pathProcessed;
std::map<std::string, std::list<std::string> > macroDefns;
void uponEOF() {
… // Code for include processing defined earlier
}
}
IFDEF_TOKEN {bool negate = false; } :
("#ifdef" | "#ifndef" {negate = true;} ) WS macroText:IDENTIFIER
{
bool macroAlreadyDefined = false;
if (macroDefns.find(macroText->getText()) != macroDefns.end())
macroAlreadyDefined = true;
processToken.push(negate? !macroAlreadyDefined : macroAlreadyDefined);
pathProcessed(processToken.top());
$setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
}
;
ELSE_TOKEN: ("#else"
{
bool& pathCondition = processToken.top();
pathCondition = !processToken.top(); // no other path is true
}
|
"#elsif" WS macroText:IDENTIFIER
{
if (!processToken.top() && !pathProcessed.top()) {
if (macroDefns.find(macroText->getText()) != macroDefns.end()) {
processToken.push(true);
pathProcessed.push(true);
}
}
else {
bool& condition = pathProcessed.top();
condition = false;
}
)
{
$setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
}
;
ENDIF_TOKEN: "#endif"
{
processToken.pop();
pathProcessed.pop();
$setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
};
|
Examining the code, you see that processToken is defined
as a stack of booleans to store the path conditions. On
encountering #ifdef , if the code is already in
true path, it tests whether the path condition is valid.
Support expressions as part of
macros
You can use a macro to represent a complex expression
that involves numbers, identifiers, mathematical operators,
and even function calls. To do so, you must use the
macroArgs string to replace every occurrence of the
macro definition in the post-processed C++ code; the context
determines whether this replacement is syntactically valid.
For example, consider this situation: #define A b, c .
Somewhere down the line you have int A; , which
is valid C++; but switch(A) is clearly invalid
C++. You need to parse the stream associated with the string
on a per-occurrence basis and let the grammar rules
determine the syntactic validity of the context.
The code for this approach works as follows:
- Capture the token corresponding to the
macroText identifier in the
cppAdvancedLexer::nextToken .
- Retrieve the
macroArgs text from the
hash table.
- Create an
istream for macroArgs
and a new copy of the lexer that parses this stream.
- Attach this lexer to the
TokenStreamSelector
object, and call retry -- this fetches the
token from the new stream that was just created.
The uponEOF method is provided in
cppLexer , which takes care of switching context on
encountering the end of the stream. Listing 8 illustrates
this discussion.
Listing 8. Modifying the
nextToken method the lexer class for macro
replacement
RefToken cppAdvancedLexer::nextToken()
{
// keep looking for a token until you don't
// get a retry exception
for (;;) {
try {
RefToken _next = cppLexer::nextToken();
map<string, string>::iterator defI = macroDefns.find(_next->getText());
if (defI != macroDefns.end() && defI->second != "")
{
std::stringstream macroStream;
cppAdvancedLexer* macrolexer = new cppAdvancedLexer(macroStream);
macrolexer->setFilename(this->getFilename());
selector.push(macrolexer);
selector.retry();
}
if (processToken.empty() || processToken.top())
return _next;
}
catch (TokenStreamRetryException& /*r*/) {
// just retry "forever"
}
}
|
Conclusion
This article discussed the basic methodology you can
follow to create a C++ preprocessor using Antlr. Developing
a full-fledged C++ preprocessor is beyond the scope of a
single article. Although this article doesn't deal with
potential errors and error-handling strategies, a real-world
preprocessor must account for these factors; for example,
the ordering of tokens in conditional macro processing is
significant, and you must take steps to ensure that it's
handled correctly. |
No comments:
Post a Comment