Parsifal Software


XIDEK
Extensible Interpreter Development Kit
Reference documentation


dxi: Direct Execution Interpreters

Introduction

The baseline "Direct Execution" interpreters, found in cll\base\dxi.syn and pll\base\dxi.syn execute the statements in a script as they parse the script. For most aspects of the interpreter, this is an extremely simple and straightforward way to approach the problem. It does not require extensive support modules as is the case with other approaches.

This immediacy, however, is purchased at the price of substantial complications in parsing. Because the parser must somehow repeat the body of loops as indicated by the loop control parameters or skip a block of code as directed by an if statement, some substantial changes to the syntax are required. In addition, the parser must also support a variety of execution modes.

Apart from these problems, however, the dxi interpreters are straightforward and easy to modify. For many problems, i.e., those which do not add new loop or conditional constructs, direct execution is probably the easiest approach.

In what follows, we describe first the features of the CLL version of the dxi.syn syntax file and how it differs from the cll.syn file. Then we describe the differences required for the PLL version.


C Prologue

Introduction

The C Prologue is the initial piece of embedded C/C++ code in a syntax file. AnaGram copies it to the parser file before writing any generated code, so it is the appropriate place to include definitions and declarations needed by the parser. In particular, any data type that is used as a token type must be defined in the C prologue, either explicitly or by means of an include file.

Include File

The iostream.h system header file is included to support the file output performed by the dump and print statements.

Mode Enumeration

When executing a script at parse time, it is often necessary to skip over a portion of the script without executing it. When skipping over code, it is also necessary to be able to determine when it is time to stop skipping code. The Mode enumeration identifies the various states the parser may occupy with respect to skipping code. The values of the enumeration are as follows:
executeMode
The parser is executing the script as it parses.

continueMode
The parser has encountered a continue statement and is skipping the remainder of a loop. When the end of the loop is found, the input pointer will be reset to continue the loop and the mode will be reset to executeMode.

breakMode
The parser is skipping to the end of the loop. This can occur because the loop condition has evaluated to false or the parser has encountered a break statement. When the end of the loop is found, the mode will be reset to the value it had on entering the loop. Thus there is no confusion in the case of nested loops.

elseMode
The parser has encountered an if statement whose condition evaluated to false. When an else is found that matches the if or the end of the if statement is encountered without finding an else the mode is reset to executeMode. The elseMode mode is also used for conditional expressions.

blockMode
The parser has finished the true side of an if, and is skipping over the else block. On reaching the end of the else block, the mode is reset to the mode in effect at the beginning of the if statement. Thus, there is no confusion in the case of if-else statements inside an else block.

scanMode
The parser is scanning the beginning of a for loop to identify its principal parts.

ParseContext Struct

The ParseContext struct is used in place of the FileLocation class to capture and record the mode as well as the location of the parser within the script. The class inherits from FileLocation and therefore captures the input pointer, the line number and the column number. The struct adds the following field:
Mode mode;
This field is used to record the parser mode.

In the configuration section of the syntax file, the context type is specified as ParseContext. The GET_CONTEXT macro calls the location() member function of the parser control block which in turn constructs a ParseContext object to store on the context stack.

The value of ParseContext corresponding to the beginning of a rule is given by the CONTEXT and PCONTEXT(PCB) macros.

VALUE Macro

The VALUE macro controls the evaluation of expressions in accordance with the current execution mode. It also checks the result for errors.

The argument to the macro is an expression. If the current mode is executeMode the expression is evaluated. If the mode is anything else, an uninitialized Value object is returned.

Note that the expression is evaluated using the arithmetic operators defined in the Value class. If they encounter an error, such as division by zero, they will throw an exception.


Configuration Section

Configuration parameters that are common to all the parsers in the kit are discussed here. The following additional configuration parameters are set in the CLL version of dxi.syn:
lexeme {SCRIPT, WHILE LOOP, DO LOOP, STATEMENT, FOR FIELD}
The listed tokens are wrapper tokens for the selector keywords used in the CLL version of the parser to identify the actual syntax to be parsed on a particular call to the parser. Declaring them to be lexemes causes AnaGram to look for white space following the detection of the wrapper token rather than following the detection of the keyword. This apparently subtle distinction allows the reduction procedures attached to the wrapper rules to be executed before white space is scanned. This causes white space to be scanned after the input pointer is changed.

wrapper {AgStack<Value>}
The wrapper declaration ensures that constructors and destructors are properly called when instances of the named classes are stored on the parser value stack. If the wrapper declaration were not used, the objects would be stored on the stack by coercing a pointer. This would cause the classes to malfunction. The rule of thumb is that if a class or any of its member fields overrides the assignment operator, it should have a wrapper declared.

parser name = dxi
This statement causes the generated parser to be named dxi. The struct defining the parser control block will be named dxi_pcb_struct and a typedef dxi_pcb_type is also defined to be equivalent to struct dxi_pcb_struct.

Parser Control Block Extensions

The parser control block is a struct used in an AnaGram parser to contain the entire state of a parse.

In order to support reentrancy, it is convenient to declare local data used by the parser in the parser control block. It is also convenient to declare functions used by the parser as members of the parser control block, though this is not, strictly speaking, necessary.

Added Fields

Dataset &dataset;
This field provides a reference to the Dataset object the interpreter is to operate on. It is set by the constructor.

Mode mode;
This field controls the mode of the parser, whether it is executing as it parses or simply skipping over text.

int loopActive;
loopActive is initialized to zero by the constructor of the parser control block. It is incremented on entering a loop and decremented on loop exit. It is used to determine whether break or continue statements are allowed. Since the PLL version does not implement break and continue statements, it does not require this switch.

FileLocation startLocation;
startLocation contains the actual starting location for the parse, as distinguished from the pointer to the initial keyword which determines which actual grammar is to be used.

Value returnValue;
This is the value that will be returned to the calling program as the value of the interpret() function. If it is not set explicitly by a return statement, the value will be have type uninitType

Local Functions

The following functions are declared to be member functions of the parser control block. The actual implementations of these functions are found in the embedded C portion of the syntax file.

Modifications to the Syntax

Introduction

There are two sets of modifications to the language syntax in the dxi interpreters. The first is straightforward: to simplify calculation, arithmetic operators, which in the language syntax are factored out of the expression syntax, have been substituted back into the expression syntax. The only significant effect of this substitution is to increase the number of states in the parser.

The alternative would be to pass an appropriate function pointer as the value of the various expression tokens. Such an approach would detract seriously from the basic simplicity of the direct execution approach to interpreting.

The second set of modifications results from the problems of implementing loops. In order to compute while parsing it is necessary to repeat the code inside loops. There are several ways to do this of varying complexity. It is possible, for example, to "unroll" loops, by backing up the input pointer and reparsing the loop. The syntax that correctly describes the actual input to the parser then is quite different from, and substantially more complex than, the syntax of the actual script language. Writing a nonambiguous syntax to describe this process is nontrivial.

Fortunately, there is a simpler technique for handling loops. In this approach, the syntax for the main parser identifies only the loop header. A reduction procedure then makes recursive calls to the parser to parse the loop. When the reduction procedure returns, the loop has been correctly executed and the input pointer has been advanced over the body of the loop, so that from the point of view of the main parser, the body of the loop doesn't even exist.

To the main parser, a while loop consists simply of the keyword "while". A do while loop consists solely of the keyword "do". A for loop is more complicated. It retains the entire control portion of the loop. The reason for this is that the condition and increment fields of the loop do not appear physically in the script in the order that they are executed. Thus, the parser must do a preliminary scan to determine the locations of these fields.

Two functions, simpleLoop() and forLoop() have been added to the parser control block to control the execution of loops. Both of these functions invoke parseSyntax() to actually make the recursive calls to the parser.

This approach makes the execution of loops quite easy, but at the expense of setting the parser up to handle the recursive calls correctly. In particular, this requires setting up the grammar so that the actual syntax to be parsed can be specified when the parser is called.


Syntax Selection

In order to support the recursive calls used to implement loops, the grammar for the direct execution parsers has been modified to allow the program that calls the parser to specify precisely what it should parse.

The technique used here is to use a number of "selector" keywords to select the specific syntax:

  • "script"
  • "whileLoop"
  • "do"
  • "statement"
  • "forField"
The keyword is specified as an argument to the parseSyntax() function which sets up the pointer in the parser control block. The second argument to parseSyntax() is a FileLocation object which is stored in the parser control block where the setStart() function can use it to set up the initial location in the script for the parse.

The way it works is as follows. Each selector keyword has a wrapper token:

    SCRIPT
     -> "script"                             =PCB.setStart();

    WHILE LOOP
     -> "whileLoop"                          =PCB.setStart();

    DO LOOP
     -> "do"                                 =PCB.setStart();

    STATEMENT
     -> "statement"                          =PCB.setStart();

    FOR FIELD
     -> "forField"                           =PCB.setStart();
The wrapper tokens, in turn, are used to select the actual syntax to be parsed. This is done in the specification of the "start" token for the grammar:
  (Value) grammar
    -> SCRIPT, statements, eof                       =Value();
    -> WHILE LOOP, loop condition, statement         =Value();
    -> DO LOOP, statement, DO WHILE, loop condition  =Value();
    -> STATEMENT, statement                          =Value();
    -> FOR FIELD, optional expression:x, ';' + ')'         =x;
When the selector keyword is recognized, the setStart() function is called to change the input pointer in the parser control block to point to the position in the script which was previously provided as an argument to the parseSyntax() function. The line and column numbers are set properly as well, in order to make sure the location of error diagnostics can be correctly identified.

Note that the parser must be prepared to return a Value object to support the for field case.

There is one other rather subtle detail that must be handled correctly to make this scheme work. The disregard white space feature is implemented by rewriting the grammar to tie optional white space to each lexeme. That means that, were no special measures taken, when the rewrite takes place white space would be tied to the selector keywords and would be parsed over before the reduction procedures are called. In and of itself, there's no problem with this, except that the white space that needs to be skipped is that which is seen after the reduction procedure tied to the selector keyword has been called and the input pointer changed. Therefore, the wrapper tokens have been declared as lexemes. This causes the white space to be scanned only after the reduction procedures have been invoked to modify the input pointer.

One unfortunate aspect of this selector method is that the File Trace feature of AnaGram cannot be used to parse a sample script file. This happens because the actual stream of characters seen by the parser differs substantially from the content of the script file.


While and Do While Loops

Because while and do while loops are executed recursively by calls to parseSyntax(), which advances the input pointer over the body of the loop, while and do while statements are represented in the main syntax simply by the keywords "while" and "do" respectively.

The function simpleLoop() controls the actual execution of the loops. simpleLoop() in turn, calls parseSyntax() recursively, using the specified selector string as an argument.

This change in the representation of the while statement means that the while statement is no longer recursive, as seen from the primary syntax. Thus the while statement is no longer involved in the "dangling else" problem. For example, consider the statement:

  if (x) while (y) if (x < y) {...} else {...}
Because the second if statement is completely invisible to the primary syntax, there is no possible question about which if the else is associated with. Thus instead of having both open while and closed while statements in the grammar, there is only one while statement, and it is classified as a simple statement.

There is an important point to notice in the parsing of the WHILE LOOP and DO LOOP syntaxes. In the reduction procedure for the loop condition rule, which is used for both while loops and do while loops, the outcome of the condition is tested only if the parser mode is executeMode. Otherwise, the mode is always set to breakMode, guaranteeing an exit from the loop.

In order to handle the loop condition properly, the "while" keyword used in the DO LOOP syntax is wrapped by the DO WHILE token. The reduction procedure attached to the wrapper calls changeMode()to change the parser mode back to executeMode if has been set to continueMode in the course of executing the loop.


For Loops

The treatment of for loops is substantially more complicated than that of while and do while loops. A for statement consists of four separate fields: the initialization field, the condition field, the increment field, and the body of the loop itself. These fields must be executed in the correct order, an order which differs from the order in which the fields appear physically in the script.

The first field in a for statement is the initialization field. It must be executed whatever the outcome of the loop condition. Of course, it is not executed if the parser mode is not executeMode. Because the condition field and the increment field must be executed independently, under the control of the forLoop() function, the parser mode is set to scanMode, so that the locations of these fields can be captured without executing them. Since the expressions in these fields can have side effects, executing them when they should not be executed could lead to errors in the values calculated by the script.

The locations of the condition and increment fields are passed as arguments to forLoop(), which, as its first order of business, calls restoreMode() to reset the parser mode to the value it had when the for loop was encountered. Thus, if the mode was anything but executeMode, the entire body of the loop will be skipped over without being executed.

As with the while and do while loops, the body of a for loop is not visible to the script syntax, and thus, like them, is not involved in the "dangling else" problem and can therefore be treated as a simple statement.


Embedded C

Macro Definitions

SYNTAX_ERROR
This definition overrides the default definition of SYNTAX_ERROR that AnaGram provides. The default definition simply writes the diagnostic to stderr and causes the parser to return with the exit_flag field of the parser control block set to AG_SYNTAX_ERROR_CODE. The overriding definition calls reportError() which formats an error message and throws an InterpreterError exception.

GET_CONTEXT
The GET_CONTEXT macro implements the context tracking feature of the AnaGram parser. In particular, it creates a ParseContext object that describes the current location in the input file and stores it on the context stack.

ParseContext Member Functions

ParseContext::ParseContext(dxi_pcb_struct &pcb);
The constructor for ParseContext initializes its member fields from the corresponding fields of the parser control block.

Parser Control Block Member Functions

dxi_pcb_struct(Dataset &);
The constructor for the parser control block sets the dataset reference. It initializes mode to executeMode and loopActive to zero. The pointer and startLocation fields are set by parseSyntax().

void parseSyntax(const char *selector, const FileLocation &start);
The parse syntax function parses the text beginning at the location specified by start using the syntax selected by selector. It does this by setting startLocation to the value given by start, setting the input pointer for the parser to the string addressed by selector, and then calling the parser function, dxi(), from a try block. If an ErrorMessage exception is thrown during the execution of the specified script text, it is caught, and reportError() is called to add location information to the message, before throwing an ErrorDiagnostic exception.

void setStart();
setStart() sets the pointer, line, and column fields of the parser control block with the starting location in the script for the parse that is to take place.

void simpleLoop(const char *selector);
simpleLoop executes a while loop or a do while loop depending on the value of selector, which may be either "whileLoop" or "doLoop". Execution of the loop is performed by recursive calls to the parser.

The first thing simpleLoop() does is to capture the current ParseContext in loopLocation so it will have a record of the beginning location of the loop. Then it creates a parser control block, loopPCB, to control the nested parser calls. The loopActive flag is set in loopPCB and the mode is set. Note that the mode is set to breakMode if the mode of the parent parser is anything but executeMode.

With the setup of the nested parser complete, it is executed by a call to parseSyntax(). When the loop condition is encountered by the parser as it parses the nested loop, it will set the mode of the parser to breakMode if the condition is false. Thus when parseSyntax() exits after one pass over the loop, the value of mode can be used to determine whether to continue the loop or to exit. Notice that the location of the parent parser is set to the final location of the nested parser, so that the portion of the script parsed by the nested parser is never seen by the parent parser.

void forLoop(const FileLocation &, const FileLocation &);
forLoop() executes a for loop by making recursive calls to the parser. On entry, the mode of the parent parser is reset to the value it had when it encountered the for statement, since it had been set to scanMode to pass over the condition and increment fields without executing them.

Next, forLoop() captures the current ParseContext in statementLocation so it will have a record of the location of the statement that is controlled by the loop. Then it creates a parser control block, loopPCB, to control the nested parser calls. The loopActive flag is set in loopPCB and the mode is set. Note that the mode is set to breakMode if the mode of the parent parser is anything but executeMode.

Execution of the loop is performed by recursive calls to the parser using the parseSyntax() function. If the mode of the parent parser is executeMode, the loop condition is evaluated. If the result is false, the mode is set to breakMode. Then the statement is executed. The location of the end of the statement is captured and the location of the parent parser is set so it can eventually continue following the for statement. If the mode of the nested parser is now breakMode, forLoop() simply returns. Otherwise, the increment expression is evaluated, and the loop continues.

ParseContext location();
This function merely creates and returns a ParseContext object capturing the current state of the parse. It is used only by the GET_CONTEXT macro. It should not be necessary, but using the ParseContext constructor directly in the GET_CONTEXT macro appears to confuse a widely used C++ compiler.

void setLocation(const FileLocation &l);
This function sets the pointer, line and column number fields of the parser control block from the argument FileLocation object. Note that the mode field is unaffected.

void setReturnValue(const Value &v);
Sets the return value of the interpret() function and terminates the parse.

int changeMode(Mode currentMode, Mode desiredMode);
This function changes mode conditionally. If the actual parser mode is the same as currentMode, then mode is set to desiredMode.

The return value is non-zero if mode was changed, zero if it remains unchanged.

void interruptLoop(Mode newMode);
This function is invoked on encountering a break or continue statement. It sets mode to the specified value which may be either breakMode or continueMode.

void restoreMode(Mode currentMode);
This function does a conditional restore of the parsing mode. If mode is equal to currentMode then mode is restored from the context stack. Otherwise, mode is left unchanged.

void restoreMode();
The parser mode, mode, is restored from the context stack.

void dump(const AgString &) const;
The dump() function implements the dump statement. It locates the variable whose name is provided as an argument, then prints the variable name and value on cout.

void print(const Value &) const;
The argument is converted to a string and printed on cout.

Value powm(Value &lv, const Value &exp) const;;
This function is used to implement the assignment version of the exponentiation operator. The first argument is an lvalue which identifies a variable. The value of the variable is raised to the power specified by the second argument.

void reportError();
The parsing engine calls this function (as directed by the SYNTAX_ERROR macro) when it encounters a syntax error. The function formats the error information and throws an ErrorDiagnostic exception.

void reportError(const char *msg);
This function is available to add context information to an error message and then throw an ErrorDiagnostic exception. It is used by the interpret() function to handle exceptions thrown during execution by methods belonging to the Value class.

The External Interface: interpret()

The external interface for the interpreter declares an instance of a parser control block. It then calls the parseSyntax() function specifying the syntax selector "script" and a FileLocation object specifying the starting location of the script.


PLL Differences

Introduction

The differences between the pll version of dxi.syn and pll.syn are very nearly identical to the corresponding CLL differences discussed above. The exceptions are as follows:
  • To support the very different for loop syntax the following changes were made:
  • Different requirements for selector keywords.
    • The syntax selector keyword "repeat" replaces "do".
    • There is no need for a "forField" selector.
  • Unsupported functionality:
    • There is no need for a powm() function.
    • Because there are no break or continue statements a loopActive switch is not necessary.

ForControl Struct

Introduction

The ForControl structure, defined in the C Prolog section of the syntax file, is used only by the PLL version of the dxi interpreter. It is used to encapsulate for loop parameters. Constructors have been defined as a convenience, though they are not really necessary. Note that since the struct contains an instance of the Value class, and is used as a token value in the parse, there must be a wrapper declaration for it in the configuration section of the file.

Member Fields

Value lvalue;
This field identifies the control variable for the loop.

long begin;
This is the initial value for the loop.

long increment;
This is the loop increment.

long end;
This is the final value for the loop.

Constructors

ForControl();
Null constructor.

ForControl(const Value &lv, const Value &b, const Value &i, const Value &e);
Initializes all member fields.

Parser Control Block Member Functions

ForControl forControl(const Value &, const Value &, const Value &, const Value &);
This function checks the parser mode to determine whether to use the null constructor or the real constructor for the ForControl struct. If the parser is not in executeMode the parameters to the function are not initialized.

void forLoop(ForControl &);
This function corresponds to the forLoop() function defined in the CLL version of dxi.

forLoop() executes a for loop by making recursive calls to the parser. On entry, the mode of the parent parser is reset to the value it had when it encountered the for statement, since it had been set to scanMode to pass over the condition and increment fields without executing them.

Next, forLoop() captures the current ParseContext in statementLocation so it will have a record of the location of the statement that is controlled by the loop. Then it creates a parser control block, loopPCB, to control the nested parser calls.

After checking that the loop increment value is non-zero, it calculates the number of times the loop body must be executed. Because the increment can be negative, it is much easier to calculate the count than to test the loop variable on each pass through the loop.

The mode is then set to breakMode if the mode of the parent parser is anything but executeMode. If the mode is executeMode the loop variable is set to the beginning value.

Execution of the loop is performed by recursive calls to the parser using the parseSyntax() function. Note that if the mode is breakMode, the loop will be parsed exactly once and the function will return. Otherwise, the loop body is executed the requisite number of times and the loop variable is updated after each pass through the loop body.


Table of Contents | Parsifal Software Home Page


XIDEK
Extensible Interpreter Development Kit
Copyright © 1997-2002, Parsifal Software.
All Rights Reserved.