Parsifal Software


XIDEK
Extensible Interpreter Development Kit
Reference documentation


CLL Syntax

Introduction

The file cll.syn contains the syntax that describes the C-like Language used by one set of baseline interpreters in the Extensible Interpreter Development Kit. This syntax supports the basic elements of a script language: recognition of variable names and constants, evaluation of expressions, and execution of conditional and loop statements, all based on the familiar C syntax. These may fairly be regarded as the portions of C that have become more or less standard and are not the subject of any significant controversy. Omitted are declarations, pointers, subscripts, and switch statements.

The syntax also contains three additional elements. To illustrate changes in expression syntax, Fortran style exponentiation has been added. To aid in debugging scripts and to illustrate how to add new statements to the grammar, dump and print statements have been added.

The files dci.syn and ast.syn use this syntax unchanged except for the addition of reduction procedures. The file dxi.syn introduces a number of syntactic changes in order to facilitate computation. The changes do not, however, affect the language recognized by the syntax.

cll.syn is a convenient peg on which to hang a discussion of those features of the script language and the parsers that are common to all the interpreters in the development kit. Apart from the lexical section of the grammar, it lacks reduction procedures, since they differ from interpreter to interpreter. cll.syn is not itself used in any of the interpreters in the kit. It can be used by developers as a clean starting point for their own development.

This document describes the elements of cll.syn in the order in which they appear.


C Prologue

The "C Prologue" is the initial piece of embedded C/C++ code in a syntax file. AnaGram copies it to the parser file before writing any generated code, so it is the appropriate place to include definitions and declarations needed by the parser. In particular, any data type that is used as a token type must be defined in the C prologue, either explicitly or by means of an include file.

There are two header files that are required for all the parsers:

comdefs.h
Provides the common definitions, primarily data type definitions, used in all the interpreters in the Extensible Interpreter Development Kit.

math.h
The standard math library is used by the numeric conversion syntax found in all of the interpreters.


Configuration Section

Introduction

The "configuration section" of a syntax file contains switches and declarations used to specify specific attributes of a parser. The following configuration parameters are common to all of the syntax files in the kit.
disregard white space
The white space token is defined to consist of blanks, tabs, newlines and other non-printing characters, as well as both C and C++ style comments. The disregard option works by rewriting the grammar so that each lexical token is replaced by a compound token that consists of the original lexical token followed by zero or more white space tokens.

Lexical tokens are raw text characters plus any tokens declared in lexeme statements.

distinguish lexemes
In the grammar as it stands, the distinguish lexemes statement is not absolutely necessary, and could be removed. It is wise to include this statement, since extensions to the grammar could easily make it necessary.

lexeme {integer, real, name, string element, character constant}
The lexeme statement ensures that the disregard logic will not be applied to characters interior to the listed tokens. Instead, white space will automatically absorbed following the specified tokens.

distinguish keywords {'a-z' + 'A-Z'}
This declaration specifies that any keyword which consists entirely of alphabetic characters must be followed by a character which is not alphabetic in order to be recognized as a keyword. Without this provision, a statement such as
done = true;
would be recognized as an erroneous "do" statement.

wrapper {AgString, Value}
The wrapper declaration ensures that constructors and destructors are properly called when instances of the named classes are stored on the parser value stack. If the wrapper declaration were not used, the objects would be stored on the stack by coercing a pointer. This would cause the classes to malfunction. The rule of thumb is that if a class or any of its member fields overrides the assignment operator, it should have a wrapper declared.

context type = FileLocation
The context type declaration requests AnaGram to maintain a context stack and specifies the data type to be stored on the stack. The context stack is used to provide information about the state of the program as it was at any point in the parse. Access to the contents of the context stack is provide by the CONTEXT and PCONTEXT macros. The context data is provided by the GET_CONTEXT macro.

The direct execution interpreters use ParseContext, which derives from FileLocation, instead of FileLocation.

parser file name = "#.cpp"
This statement directs AnaGram to give the parser file the same name as the syntax file with the file extension set to cpp.

header file name = "../include/#.h"
This statement directs AnaGram to give the header file the same name as the syntax file and to place it in the include directory. Without this statement the header file would be written to the directory that contains the syntax file.

test file mask = "*.scf"
This statement tells the File Trace option to initially display only files with the extension scf when it pops up a file selection menu. The common demo program also uses the same default extension, so if the extension is changed, it should be changed in both places.

no cr
The no cr directive tells AnaGram not to insert carriage returns in the code it generates. Generally, compilers running in a win32 environment function correctly whether or not carriage returns are present. In any unix environment, however, carriage returns must be avoided. Thus maximum portability is guaranteed by specifying no cr.

pointer input
The pointer input directive informs AnaGram that it should set the parser up to read its input from an array in memory using a pointer. The pointer variable in the parser control block is declared, by default, as unsigned char *. The current value of the pointer is available to reduction procedures as PCB.pointer.

pointer type = const unsigned char *
The default type of the input pointer is unsigned char *. This statement simply changes it to constant, since the input text is not changed by any of the parsers.

reentrant parser
The reentramt parser directive tells AnaGram to configure the parser function as a reentrant function that takes a pointer to the parser control block as an argument. Such parsers are threadsafe as long as reduction procedures do not set or modify static or global variables. The extend pcb statement is used in all three parsers in the development kit to add extra variables to the parser control block to avoid the use of static or global variables.


Statement Syntax

Introduction

A script is a sequence of statements. The types of statements supported are the same as in C with the exception of switch statements and declarations. In addition, there are simple dump and print statement.

If and If-Else Statements

It is conventional to write the syntax for the if and if-else statements in C and C++ as follows:
  if statement
    -> "if", '(', expression, ')', statement
    -> "if", '(', expression, ')', statement, "else", statement
This has the advantage of succinctness and clarity. It has the disadvantage of ambiguity. In the statement
    if (condition_one) if (condition_two) do_something();
    else do_something_different();
Does the else belongs to the first or the second if? Although, by convention the else belongs to the second if, that is, to the if physically closest to it, both interpretations are consistent with the above syntax. This problem is often referred to as the "dangling else problem", or the "if-else ambiguity".

There are two ways to deal with the problem: resort to some sort of trick to force the parser to select the correct interpretation, or rewrite the grammar to avoid the difficulty. Using AnaGram, it is possible to use a sticky directive to handle the problem. Generally, however, it is best not to resort to such tricks. The parsers in the development kit, therefore, have been written in such a manner as to avoid this ambiguity.

In order to support this technique, statements are classified as "open statements" or "closed statements", depending on whether or not they could be followed by an else clause.

There are, therefore, both open and closed if statements:

  open statement
    -> if condition, statement
    -> if condition, closed statement, "else", open statement

  closed statement
    -> if condition, closed statement, "else", closed statement
Notice that a simple if statement is always an open statement. If-else statements are open or closed depending on whether the statement following the else is open or closed.

The script language designer should note that the dangling else problem is a direct result of what is fundamentally unnecessary generality in the definition of C. If the definitions were

  if statement
    -> "if", '(', expression, ')', compound statement
    -> "if", '(', expression, ')', compound statement,
       "else", compound statement
there would be no problem. In other words, by allowing the user to avoid writing a pair of curly braces, substantial complication has been introduced into the language, as well as substantial opportunities for introducing bugs into programs. Certainly every programmer has seen the following problem:
    if (condition)
      do_something();
      do_something_else();
where the call to do_something_else() is not controlled by the if, in spite of the suggestive indentation. If the language had been defined to only allow compound statements to be controlled by if statements, this opportunity for error would not exist.


While Statements

While statements are exactly as in C. In order to handle the dangling else problem, the syntax contains two while statements: an open while statement and a closed while statement.
  open statement
    -> WHILE, '(', expression, ')', open statement

  closed statement
    -> WHILE, '(', expression, ')', closed statement
Note that a while statement is open or closed depending on whether the statement controlled by the while is open or closed.


For Statements

For statements are exactly as in C. In order to handle the dangling else problem, the syntax contains two for statements: an open for statement and a closed for statement.
  open statement
    -> FOR, '(', optional expression, ';',
       optional expression, ';',
       optional expression, ')', open statement

  closed statement
    -> FOR, '(', optional expression, ';',
       optional expression, ';',
       optional expression, ')', closed statement
Note that a for statement is open or closed depending on whether the statement controlled by the for is open or closed.


Do-While Statements

The do-while statement is exactly as in C. Since the statement controlled by the do and while keywords is entirely enclosed by them, the scope of an if between the do and while keywords is limited by the while. Thus there is no opportunity for a dangling else, and the do-while statement is a closed statement.


Break and Continue Statements

The break and continue statements are exactly as in C. These statements, though apparently simple, are among the most difficult to implement in interpreters. The reason is that all other statements fit into a neat hierarchical structure. The break and continue statements do not fit neatly in that hierarchy. Thus a good deal of the complexity of the interpreters in the development kit is directly a result of choosing to implement break and continue. A script language designer might well choose to avoid break and continue in the interest of simplicity.


Expression Statements

An expression statement is exactly like an expression statement in C: simply an expression followed by a semicolon.


Null Statement

The conventional definition of an expression statement consists of an optional expression followed by a semicolon. It is more convenient to think of the expression statement as always having a non-vanishing expression, and adding a "null statement" to deal with a free-standing semicolon.


Compound Statement

A compound statement, as in C, consists of zero or more statements enclosed in curly braces.


Return Statement

A return statement, as in C, consists of the keyword "return" and an optional expression. The return statement causes the interpreter to exit to the calling program returning the value of the specified expression. If no expression is provided, the return value is a Value object with type equal to uninitType.


Dump Statement

The dump statement has been added to aid in testing and debugging. It also serves as an example of how to add non-C statements to the syntax.

The dump statement consists of the keyword "dump", followed by a comma-delimited list of variable names. All the interpreters implement this statement by printing the name of each listed variable and its value, each variable on a line by itself.


Print Statement

The print statement is a second example of adding a non-C statement to the syntax. It takes a comma delimited list of assignment expressions and prints each value in turn. Note that the print statement could just as well have been written
  print statement
    -> "print", arg list


Expression Syntax

Introduction

The expression syntax used for all the interpreters in the development kit is almost that of C and C++. It does not support subscripts, pointers, casts other than (long) or (double), or the general function call syntax. On the other hand, it does support exponentiation using the ** operator as in Fortran. If exponentiation is not desired, it may be removed quite easily, as described here.

The expression syntax, following conventional usage, implements operator precedence using normal syntactic conventions. It does not use shortcuts such as operator precedence declarations.

The conventional method for writing expression syntax is to define a token for each level of precedence, beginning with the lowest level. At least two rules are written for each of these tokens. The first rule simply consists of the token that defines the next higher level of precedence. The other rules define the actual operators at the given precedence level. These latter rules are written as left or right recursive rules, depending on whether the relevant operator groups to the left or to the right.

Operators are said to "group to the left" or "group to the right" depending on the implicit parenthesization that is understood. Subtraction, for example, groups to the left. We interpret

    a - b - c
as
    (a - b) - c
It would contravene mathematical conventions to compute it as
    a - (b - c)
On the other hand, the logical or operator groups to the right:
    x || y || z
is fairly interpreted as
    x || (y || z)
From a mathematical point of view, associative operators may be grouped either to the left or the right. Nonassociative operators, to be consistent with conventional usage, must be grouped to the left.

Since grouping operators to the right requires right recursion and right recursion runs a remote risk of parser stack overflow, it is preferable to group operators to the left wherever it makes sense to do so.

In the following, the tokens that define expression syntax are described in order of increasing precedence.


expression

  expression
    -> assignment expression
    -> expression, ',', assignment expression
These rules define the comma operator as the lowest level of precedence in the expression syntax. Since expression is left recursive, items group to the left. The next higher level of precedence is specified by assignment expression.

Note that the expression to the left of the comma must be evaluated and its value discarded in order to allow for side effects. The proper functioning of the initialization, condition, and increment fields of the for statement depend on this property of the comma operator.

Note that if you need a comma delimited list of expressions, as in arg list and the print statement, you should use assignment expression rather than expression to avoid confusion with the comma operator.


assignment expression

  assignment expression
    -> conditional expression
    -> name, assignment op, assignment expression

  assignment op
    -> '='
    -> "+="
    -> "-="
    -> "*="
    -> "/="
    -> "%="
    -> "|="
    -> "&="
    -> "^="
    -> "<<="
    -> ">>="
Assignment operators group to the right, since it would not make sense for them to group to the left. They take precedence over the comma operator.


conditional expression

  conditional expression
    -> logical or expression
    -> logical or expression, '?', expression,
       ':', conditional expression
The conditional expression is a ternary operator that groups to the right. The conditional expression, as defined in C, presents an implementation challenge, since of the two expressions selected by the logical expression, only one is to be executed. If both were to be executed and the results of one simply discarded, there would be a problem with possible side effects created by the expression that was discarded.

The ? operator groups to the right and takes precedence over the comma operator.


logical or expression

  logical or expression
    -> logical and expression
    -> logical and expression, "||", logical or expression
The logical or expression groups to the left. Note that like the conditional expression and the logical and expression there is a problem with side effects since the expression to the right of the operator is to be evaluated only if the expression to the left returns false.

The || operator groups to the right and takes precedence over ?.


logical and expression

  logical and expression
    -> inclusive or expression
    -> inclusive or expression, "&&", logical and expression
The logical and expression groups to the left. Note that like the conditional expression and the logical or expression there is a problem with side effects since the expression to the right of the operator is to be evaluated only if the expression to the left returns true.

The && operator groups to the right and takes precedence over ||.


inclusive or expression

  inclusive or expression
    -> exclusive or expression
    -> inclusive or expression, '|', exclusive or expression
The | operator groups to the left and takes precedence over &&.


exclusive or expression

  exclusive or expression
    -> and expression
    -> exclusive or expression, '^', and expression
The ^ operator groups to the left and takes precedence over |.


and expression

  and expression
    -> equality expression
    -> and expression, '&', equality expression
The & operator groups to the left and takes precedence over ^.


equality expression

  equality expression
    -> relational expression
    -> equality expression, equality op, relational expression

  equality op
    -> "=="
    -> "!="
Equality operators group to the left and take precedence over &.


relational expression

  relational expression
    -> shift expression
    -> relational expression, relational op, shift expression

  relational op
    -> '<'
    -> "<="
    -> '>'
    -> ">="
Relational operators group to the left and take precedence over equality operators.


shift expression

  shift expression
    -> additive expression
    -> shift expression, shift op, additive expression

  shift op
    -> "<<"
    -> ">>"
Shift operators group to the left and take precedence over relational operators.


additive expression

  additive expression
    -> multiplicative expression
    -> additive expression, additive op, multiplicative expression

  additive op
    -> '+'
    -> '-'
Note that additive operators group to the left and take precedence over shift operators.


multiplicative expression

  multiplicative expression
    -> unary expression
    -> multiplicative expression, multiplicative op, unary expression

  multiplicative op
    -> '*'
    -> '/'
    -> '%'
Multiplicative operators group to the left and take precedence over additive operators.


unary expression

  unary expression
    -> factor
    -> '+', unary expression
    -> unary op, unary expression

  unary op
    -> '-'
    -> '!'
    -> '~'
The unary + is treated separately, since it does not normally require any computation.

Unary operators group to the right and take precedence over multiplicative operators.


factor

  factor
    -> primary
    -> primary, "**", unary expression
The exponentiation operator has a distinctive operator precedence, in that its precedence is different to the right and to the left. It takes precedence over leading minus signs for example:
    -x ** y
is construed as
    -(x ** y)
but at the same time, it is possible to write:
    x ** -y
without the need for parentheses around the right hand operand. These rules reflect normal mathematical conventions, though they are difficult to achieve with simple precedence grammars.

To remove support for exponentiation, it is sufficient to rewrite the productions for unary expression replacing factor with primary:

  unary expression
    -> primary
    -> '+', unary expression
    -> unary op, unary expression

primary

  primary
    -> '(', expression, ')'
    -> constant
    -> lvalue
    -> "++", lvalue
    -> "--", lvalue
    -> lvalue, "++"
    -> lvalue, "--"
    -> function call
    -> '(', "long", ')', primary
    -> '(', "double", ')', primary
Because the development kit language does not support arbitrary casts, subscripts, or pointers, primary is somewhat different from the corresponding token definition in a standard C syntax.

Note that primary has simple right recursion for the two cast operators that are defined and the much more complex center recursion for parentheses. Otherwise, primary is the leaf node of the expression tree. Unlike C and C++, the auto-increment and auto-decrement operators are defined simply as operating directly on variables. This is possible because the development kit does not support pointers.


function call

  function call
    -> name, '(', optional arg list, ')'
The function call definition is designed to allow function calls with a comma delimited argument list. The rules for the arg list token use the assignment expression token instead of expression since, in a calling sequence, comma is used to delimit arguments. If the grammar had been written only for use with parsers that generate an abstract syntax tree, it would have been possible to write the function call rule as:
  function call
    -> name, '(', optional expression, ')'
since tree-walk functions could separate out the individual arguments.


lvalue

An lvalue is a reference to a value. In the Value class, lvalues are implemented as pointers. It would have been possible to implement the interpreters without using an lvalue construct, but extensions, such as adding arrays and subscripting, would have been noticeably more complex, since they would require changes to the baseline syntax rather than the simple addition of new productions.


constant

The interpreters in the kit support recognition of real, integer, string and character constants. In each case the constant value is stored in a Value object.


name

The syntactic rules for name are the same as for variable names in C and C++. Names consist of an arbitrary length string of alphabetic characters, digits and underscores, subject to the side condition that the first character is not a digit.


Lexical Rules

Introduction

AnaGram parsers do not customarily use separate lexical scanners. Instead they use conventional Backus-Naur form to describe the grammar right down to the character level. The significant difference between AnaGram and conventional parsers, is that the terminal tokens represent character sets rather than individual characters, although many character sets may well consist of a single character.

Character set expression may be used directly as tokens in any rule. As an alternative, they may be named for convenience. In this grammar, four character sets are given explicit names.

Since the parser is configured to take its input from a string in memory, eof is defined simply as a null byte. If, on the other hand, the parser were intended to take stream input, one might write:

     eof = -1 + 0 + ^D + ^Z
Such a definition would allow for all of the conventional representations of end of file.

The definitions of the name token and the white space token are self-explanatory.


String Constants

String constants follow the same rules as C and C++. Support for strings is entirely dependent on the handling of string values in the Value class.


Numeric Constants

The recognition of numeric constants would be self-explanatory except for one significant problem. In the C/C++ scheme of things, the conventional scheme for representing an octal integer is to begin it with a leading zero. Decimal integers must begin with a non-zero digit, so there is no difficulty in distinguishing decimal from octal integers. The difficulty arises because there is no restriction on the initial digit of a floating point number.

The upshot of this is that what starts out looking as though it is an octal integer can suddenly turn into a floating point number. If this should happen, the octal conversion is wrong and needs to be redone. The makeDecimal() function recovers the original digits and reconverts them as decimal. The token hybrid integer is used to accumulate these constants.



Table of Contents | Parsifal Software Home Page


Interpreter Development Kit
Copyright © 1997-2002, Parsifal Software.
All Rights Reserved.