XIDEK
|
CLL SyntaxIntroductionThe file cll.syn contains the syntax that describes the C-like Language used by one set of baseline interpreters in the Extensible Interpreter Development Kit. This syntax supports the basic elements of a script language: recognition of variable names and constants, evaluation of expressions, and execution of conditional and loop statements, all based on the familiar C syntax. These may fairly be regarded as the portions of C that have become more or less standard and are not the subject of any significant controversy. Omitted are declarations, pointers, subscripts, and switch statements.The syntax also contains three additional elements. To illustrate changes in expression syntax, Fortran style exponentiation has been added. To aid in debugging scripts and to illustrate how to add new statements to the grammar, dump and print statements have been added. The files dci.syn and ast.syn use this syntax unchanged except for the addition of reduction procedures. The file dxi.syn introduces a number of syntactic changes in order to facilitate computation. The changes do not, however, affect the language recognized by the syntax. cll.syn is a convenient peg on which to hang a discussion of those features of the script language and the parsers that are common to all the interpreters in the development kit. Apart from the lexical section of the grammar, it lacks reduction procedures, since they differ from interpreter to interpreter. cll.syn is not itself used in any of the interpreters in the kit. It can be used by developers as a clean starting point for their own development. This document describes the elements of cll.syn in the order in which they appear.
C PrologueThe "C Prologue" is the initial piece of embedded C/C++ code in a syntax file. AnaGram copies it to the parser file before writing any generated code, so it is the appropriate place to include definitions and declarations needed by the parser. In particular, any data type that is used as a token type must be defined in the C prologue, either explicitly or by means of an include file.There are two header files that are required for all the parsers:
Configuration Section
IntroductionThe "configuration section" of a syntax file contains switches and declarations used to specify specific attributes of a parser. The following configuration parameters are common to all of the syntax files in the kit.
Statement Syntax
IntroductionA script is a sequence of statements. The types of statements supported are the same as in C with the exception of switch statements and declarations. In addition, there are simple dump and print statement.
If and If-Else StatementsIt is conventional to write the syntax for the if and if-else statements in C and C++ as follows:if statement -> "if", '(', expression, ')', statement -> "if", '(', expression, ')', statement, "else", statementThis has the advantage of succinctness and clarity. It has the disadvantage of ambiguity. In the statement if (condition_one) if (condition_two) do_something(); else do_something_different();Does the else belongs to the first or the second if? Although, by convention the else belongs to the second if, that is, to the if physically closest to it, both interpretations are consistent with the above syntax. This problem is often referred to as the "dangling else problem", or the "if-else ambiguity". There are two ways to deal with the problem: resort to some sort of trick to force the parser to select the correct interpretation, or rewrite the grammar to avoid the difficulty. Using AnaGram, it is possible to use a sticky directive to handle the problem. Generally, however, it is best not to resort to such tricks. The parsers in the development kit, therefore, have been written in such a manner as to avoid this ambiguity. In order to support this technique, statements are classified as "open statements" or "closed statements", depending on whether or not they could be followed by an else clause. There are, therefore, both open and closed if statements: open statement -> if condition, statement -> if condition, closed statement, "else", open statement closed statement -> if condition, closed statement, "else", closed statementNotice that a simple if statement is always an open statement. If-else statements are open or closed depending on whether the statement following the else is open or closed. The script language designer should note that the dangling else problem is a direct result of what is fundamentally unnecessary generality in the definition of C. If the definitions were if statement -> "if", '(', expression, ')', compound statement -> "if", '(', expression, ')', compound statement, "else", compound statementthere would be no problem. In other words, by allowing the user to avoid writing a pair of curly braces, substantial complication has been introduced into the language, as well as substantial opportunities for introducing bugs into programs. Certainly every programmer has seen the following problem: if (condition) do_something(); do_something_else();where the call to do_something_else() is not controlled by the if, in spite of the suggestive indentation. If the language had been defined to only allow compound statements to be controlled by if statements, this opportunity for error would not exist.
While StatementsWhile statements are exactly as in C. In order to handle the dangling else problem, the syntax contains two while statements: an open while statement and a closed while statement.open statement -> WHILE, '(', expression, ')', open statement closed statement -> WHILE, '(', expression, ')', closed statementNote that a while statement is open or closed depending on whether the statement controlled by the while is open or closed.
For StatementsFor statements are exactly as in C. In order to handle the dangling else problem, the syntax contains two for statements: an open for statement and a closed for statement.open statement -> FOR, '(', optional expression, ';', optional expression, ';', optional expression, ')', open statement closed statement -> FOR, '(', optional expression, ';', optional expression, ';', optional expression, ')', closed statementNote that a for statement is open or closed depending on whether the statement controlled by the for is open or closed.
Do-While StatementsThe do-while statement is exactly as in C. Since the statement controlled by the do and while keywords is entirely enclosed by them, the scope of an if between the do and while keywords is limited by the while. Thus there is no opportunity for a dangling else, and the do-while statement is a closed statement.
Break and Continue StatementsThe break and continue statements are exactly as in C. These statements, though apparently simple, are among the most difficult to implement in interpreters. The reason is that all other statements fit into a neat hierarchical structure. The break and continue statements do not fit neatly in that hierarchy. Thus a good deal of the complexity of the interpreters in the development kit is directly a result of choosing to implement break and continue. A script language designer might well choose to avoid break and continue in the interest of simplicity.
Expression StatementsAn expression statement is exactly like an expression statement in C: simply an expression followed by a semicolon.
Null StatementThe conventional definition of an expression statement consists of an optional expression followed by a semicolon. It is more convenient to think of the expression statement as always having a non-vanishing expression, and adding a "null statement" to deal with a free-standing semicolon.
Compound StatementA compound statement, as in C, consists of zero or more statements enclosed in curly braces.
Return StatementA return statement, as in C, consists of the keyword "return" and an optional expression. The return statement causes the interpreter to exit to the calling program returning the value of the specified expression. If no expression is provided, the return value is a Value object with type equal to uninitType.
Dump StatementThe dump statement has been added to aid in testing and debugging. It also serves as an example of how to add non-C statements to the syntax.The dump statement consists of the keyword "dump", followed by a comma-delimited list of variable names. All the interpreters implement this statement by printing the name of each listed variable and its value, each variable on a line by itself.
Print StatementThe print statement is a second example of adding a non-C statement to the syntax. It takes a comma delimited list of assignment expressions and prints each value in turn. Note that the print statement could just as well have been writtenprint statement -> "print", arg list
Expression Syntax
IntroductionThe expression syntax used for all the interpreters in the development kit is almost that of C and C++. It does not support subscripts, pointers, casts other than (long) or (double), or the general function call syntax. On the other hand, it does support exponentiation using the ** operator as in Fortran. If exponentiation is not desired, it may be removed quite easily, as described here.The expression syntax, following conventional usage, implements operator precedence using normal syntactic conventions. It does not use shortcuts such as operator precedence declarations. The conventional method for writing expression syntax is to define a token for each level of precedence, beginning with the lowest level. At least two rules are written for each of these tokens. The first rule simply consists of the token that defines the next higher level of precedence. The other rules define the actual operators at the given precedence level. These latter rules are written as left or right recursive rules, depending on whether the relevant operator groups to the left or to the right. Operators are said to "group to the left" or "group to the right" depending on the implicit parenthesization that is understood. Subtraction, for example, groups to the left. We interpret a - b - cas (a - b) - cIt would contravene mathematical conventions to compute it as a - (b - c)On the other hand, the logical or operator groups to the right: x || y || zis fairly interpreted as x || (y || z)From a mathematical point of view, associative operators may be grouped either to the left or the right. Nonassociative operators, to be consistent with conventional usage, must be grouped to the left. Since grouping operators to the right requires right recursion and right recursion runs a remote risk of parser stack overflow, it is preferable to group operators to the left wherever it makes sense to do so. In the following, the tokens that define expression syntax are described in order of increasing precedence.
expressionexpression -> assignment expression -> expression, ',', assignment expressionThese rules define the comma operator as the lowest level of precedence in the expression syntax. Since expression is left recursive, items group to the left. The next higher level of precedence is specified by assignment expression. Note that the expression to the left of the comma must be evaluated and its value discarded in order to allow for side effects. The proper functioning of the initialization, condition, and increment fields of the for statement depend on this property of the comma operator. Note that if you need a comma delimited list of expressions, as in arg list and the print statement, you should use assignment expression rather than expression to avoid confusion with the comma operator.
assignment expressionassignment expression -> conditional expression -> name, assignment op, assignment expression assignment op -> '=' -> "+=" -> "-=" -> "*=" -> "/=" -> "%=" -> "|=" -> "&=" -> "^=" -> "<<=" -> ">>="Assignment operators group to the right, since it would not make sense for them to group to the left. They take precedence over the comma operator.
conditional expressionconditional expression -> logical or expression -> logical or expression, '?', expression, ':', conditional expressionThe conditional expression is a ternary operator that groups to the right. The conditional expression, as defined in C, presents an implementation challenge, since of the two expressions selected by the logical expression, only one is to be executed. If both were to be executed and the results of one simply discarded, there would be a problem with possible side effects created by the expression that was discarded. The ? operator groups to the right and takes precedence over the comma operator.
logical or expressionlogical or expression -> logical and expression -> logical and expression, "||", logical or expressionThe logical or expression groups to the left. Note that like the conditional expression and the logical and expression there is a problem with side effects since the expression to the right of the operator is to be evaluated only if the expression to the left returns false. The || operator groups to the right and takes precedence over ?.
logical and expressionlogical and expression -> inclusive or expression -> inclusive or expression, "&&", logical and expressionThe logical and expression groups to the left. Note that like the conditional expression and the logical or expression there is a problem with side effects since the expression to the right of the operator is to be evaluated only if the expression to the left returns true. The && operator groups to the right and takes precedence over ||.
inclusive or expressioninclusive or expression -> exclusive or expression -> inclusive or expression, '|', exclusive or expressionThe | operator groups to the left and takes precedence over &&.
exclusive or expressionexclusive or expression -> and expression -> exclusive or expression, '^', and expressionThe ^ operator groups to the left and takes precedence over |.
and expressionand expression -> equality expression -> and expression, '&', equality expressionThe & operator groups to the left and takes precedence over ^.
equality expressionequality expression -> relational expression -> equality expression, equality op, relational expression equality op -> "==" -> "!="Equality operators group to the left and take precedence over &.
relational expressionrelational expression -> shift expression -> relational expression, relational op, shift expression relational op -> '<' -> "<=" -> '>' -> ">="Relational operators group to the left and take precedence over equality operators.
shift expressionshift expression -> additive expression -> shift expression, shift op, additive expression shift op -> "<<" -> ">>"Shift operators group to the left and take precedence over relational operators.
additive expressionadditive expression -> multiplicative expression -> additive expression, additive op, multiplicative expression additive op -> '+' -> '-'Note that additive operators group to the left and take precedence over shift operators.
multiplicative expressionmultiplicative expression -> unary expression -> multiplicative expression, multiplicative op, unary expression multiplicative op -> '*' -> '/' -> '%'Multiplicative operators group to the left and take precedence over additive operators.
unary expressionunary expression -> factor -> '+', unary expression -> unary op, unary expression unary op -> '-' -> '!' -> '~'The unary + is treated separately, since it does not normally require any computation. Unary operators group to the right and take precedence over multiplicative operators.
factorfactor -> primary -> primary, "**", unary expressionThe exponentiation operator has a distinctive operator precedence, in that its precedence is different to the right and to the left. It takes precedence over leading minus signs for example: -x ** yis construed as -(x ** y)but at the same time, it is possible to write: x ** -ywithout the need for parentheses around the right hand operand. These rules reflect normal mathematical conventions, though they are difficult to achieve with simple precedence grammars. To remove support for exponentiation, it is sufficient to rewrite the productions for unary expression replacing factor with primary: unary expression -> primary -> '+', unary expression -> unary op, unary expression primaryprimary -> '(', expression, ')' -> constant -> lvalue -> "++", lvalue -> "--", lvalue -> lvalue, "++" -> lvalue, "--" -> function call -> '(', "long", ')', primary -> '(', "double", ')', primaryBecause the development kit language does not support arbitrary casts, subscripts, or pointers, primary is somewhat different from the corresponding token definition in a standard C syntax. Note that primary has simple right recursion for the two cast operators that are defined and the much more complex center recursion for parentheses. Otherwise, primary is the leaf node of the expression tree. Unlike C and C++, the auto-increment and auto-decrement operators are defined simply as operating directly on variables. This is possible because the development kit does not support pointers.
function callfunction call -> name, '(', optional arg list, ')'The function call definition is designed to allow function calls with a comma delimited argument list. The rules for the arg list token use the assignment expression token instead of expression since, in a calling sequence, comma is used to delimit arguments. If the grammar had been written only for use with parsers that generate an abstract syntax tree, it would have been possible to write the function call rule as: function call -> name, '(', optional expression, ')'since tree-walk functions could separate out the individual arguments.
lvalueAn lvalue is a reference to a value. In the Value class, lvalues are implemented as pointers. It would have been possible to implement the interpreters without using an lvalue construct, but extensions, such as adding arrays and subscripting, would have been noticeably more complex, since they would require changes to the baseline syntax rather than the simple addition of new productions.
constantThe interpreters in the kit support recognition of real, integer, string and character constants. In each case the constant value is stored in a Value object.
nameThe syntactic rules for name are the same as for variable names in C and C++. Names consist of an arbitrary length string of alphabetic characters, digits and underscores, subject to the side condition that the first character is not a digit.
Lexical RulesIntroductionAnaGram parsers do not customarily use separate lexical scanners. Instead they use conventional Backus-Naur form to describe the grammar right down to the character level. The significant difference between AnaGram and conventional parsers, is that the terminal tokens represent character sets rather than individual characters, although many character sets may well consist of a single character.Character set expression may be used directly as tokens in any rule. As an alternative, they may be named for convenience. In this grammar, four character sets are given explicit names. Since the parser is configured to take its input from a string in memory, eof is defined simply as a null byte. If, on the other hand, the parser were intended to take stream input, one might write: eof = -1 + 0 + ^D + ^ZSuch a definition would allow for all of the conventional representations of end of file. The definitions of the name token and the white space token are self-explanatory.
String ConstantsString constants follow the same rules as C and C++. Support for strings is entirely dependent on the handling of string values in the Value class.
Numeric ConstantsThe recognition of numeric constants would be self-explanatory except for one significant problem. In the C/C++ scheme of things, the conventional scheme for representing an octal integer is to begin it with a leading zero. Decimal integers must begin with a non-zero digit, so there is no difficulty in distinguishing decimal from octal integers. The difficulty arises because there is no restriction on the initial digit of a floating point number.The upshot of this is that what starts out looking as though it is an octal integer can suddenly turn into a floating point number. If this should happen, the octal conversion is wrong and needs to be redone. The makeDecimal() function recovers the original digits and reconverts them as decimal. The token hybrid integer is used to accumulate these constants.
|
Table of Contents | | | Parsifal Software Home Page |
---|
Interpreter Development Kit
Copyright © 1997-2002, Parsifal Software.
All Rights Reserved.