Parsifal Software


XIDEK
Extensible Interpreter Development Kit


Building Custom Interpreters

Introduction

XIDEK provides a foundation for building interpreters for domain specific languages or "little languages". It provides the basic syntactic elements of an interpreter: control statements and expressions. The component parts of XIDEK are implemented using the AnaGram parser generator and the AGCLIB1 class library to provide a solid basis for the full range of interpreter applications, from the simplest to the most demanding.

Many applications require additional features in the script language. These can be implemented by modifying or extending one of the interpreters in the kit. In this chapter we will discuss some of the changes a developer might wish to make and how he might handle these changes.

Script interpreters are a useful technique for bridging the gap between the computer specialist on the one hand and the problem domain specialist on the other. A well designed interpreter allows the computer specialist to provide the implementation of problem domain functionality while the actual application of the functionality to specific problems is left to a problem domain specialist. For an excellent exposition of this approach, see

Arie van Deursen and Paul Klint.
Little languages: little maintenance?
Proceedings of DSL '97 (First ACM SIGPLAN Workshop on Domain-Specific Languages)
(Paris, France, 18 Jan. 1997).
Published as University of Illinois Computer Science Report,
http://www-sal.cs.uiuc.edu/~kamin/dsl/papers/van-deursen.ps
Also available at:
http://www.cwi.nl/~arie/papers/domain.pdf
It should be noted in passing that the authors of this report consider the maintenance of the script interpreter (in their case a translator that generated COBOL) to be a potential problem, because of the perceived difficulty of compiler technology. One of the primary design goals of XIDEK was to address this problem by providing a solid infrastructure for interpreter development with clear segregation of those parts of the problem which require "compiler technology". Another was to demonstrate that, given XIDEK and the AnaGram LALR Parser Generator, the amount of compiler technology required for development and maintenance of a useful language is actually not great in most cases.

Another way of looking at this use of interpreters is as an extension of the concept of object oriented programming. From this point of view, the computer technologist can be seen as implementing classes which describe problem domain objects and operations on them. The script interpreter then allows a problem domain specialist to use these operations to specify and control the interactions among the problem domain objects, without the need for using a full-fledged programming language. The relationship between the software specialist and the problem domain specialist is then somewhat like the relationship between the computer architect who designs a cpu and the programmer.

For example, the computer architect defines various "objects", i.e., bytes, integers, floats, doubles, etc. and operations on them, i.e. machine instructions. The computer programmer, in turn, creatively combines these objects and operations to create a computer program. Just as a computer programmer uses assemblers and compilers to clarify and simplify his work, a problem domain specialist can use a set of well defined objects and an interpreter to create his own application without having to explain his specialized knowledge to a computer programmer.

The success of this approach depends strongly on making the script language congenial to and readily understandable by the problem domain specialist. This is more easily achieved with the support of a tool like XIDEK specifically designed to make modifications to the language as easy as possible.

Extensions to interpreters in the kit can be seen as falling into two categories: those which do and those which do not require modifications of or additions to the language syntax. In those cases where there are no modifications to the language, most developers will find themselves in familiar territory. For instance, modifying the host program interface, modifying error handling, adding new data types: all these are conventional programming exercises which require no prior experience with compiler construction technology.

Working Examples of Interpreter Extensions

The baseline interpreters have been designed to facilitate adding new features. They have been implemented in a modular fashion so that adding new features usually requires only adding new code to support them, not modification of existing code.

To illustrate the techniques involved in adding additional features to an interpreter, a number of examples have been provided which implement particular familiar features. These examples have been implemented for both the CLL and the PLL language, and for all four interpreter architectures.

Except for the complex arithmetic example, the examples implement useful interpreter features that have not been implemented in the baseline interpreters. Generally these are features where no consensus has yet evolved on how they should be implemented.

The examples are as follows:

  • Complex Arithmetic
    Link to Reference Documentation
    This example adds support for complex arithmetic to the baseline interpreters. No syntax modifications are required, but on the other hand, changes to the Value class are extensive.

  • Arrays and Subscripts
    Link to Reference Documentation
    This example adds support for arrays and subscripts to the baseline interpreters. It adds arrayType and charPointerType objects to the Value class. It illustrates the use of built-in functions, statement syntax and expression syntax to support the new data types.

  • Structures and Objects
    Link to Reference Documentation
    This example adds support for structures and objects to the baseline interpreters. It adds structureType and instanceType objects to the Value class. It illustrates the use of statement syntax and expression syntax to support the new data types.

  • Local Function Definitions
    Link to Reference Documentation
    This example adds support for local function definitions to the baseline interpreters. It adds functionType objects to the Value class. It illustrates the use of statement syntax to support the new data type. Note that no new expression syntax is required, since function calls are already implemented in the baseline syntax.

  • Combining Arrays, Structures and Functions
    This example simply combines the above three extensions into a single example containing array, structure and function functionality. Since these extensions are effectively independent, the modifications in this example are simply the sum of the modifications for the array, structure and function extensions.

    Note that the name list token is defined identically in both the structure example and the function example. Only one instance of name list is defined in the combination example. If it were desired to have separate name list tokens, they would have to be given different names, e.g., struct name list and function name list.

    For reference documentation, see the reference documentation for the component extensions.

The Host Program Interface

As perceived by end users, script interpreters can be either standalone interpreters or embedded interpreters. Standalone interpreters, by far the most familiar, are main programs that are invoked directly by end users to solve a particular problem. Embedded interpreters are interpreters that are incorporated into a larger program to provide an extra degree of user control over the program. In either case, the actual interpreter consists of one or more functions which are invoked by a host program. The difference between the two cases is entirely in the manner in which the text of the script and the data are handled before and after they are presented to the interpreter itself.

The host program interface provided by XIDEK is meant to be illustrative rather than definitive. This is particularly true for the dci, astci, and astxi architectures. An application that uses the dci or astci architecture might save the bytecode associated with a script and recompile only when a new or modified script is encountered. Similarly, an application that uses the astxi architecture might save the abstract syntax tree associated with a script and regenerate it only for new or modified scripts.

Error handling in XIDEK, similarly, is meant to be illustrative. The challenge in handling errors is to deliver all relevant information about the error to the appropriate party. From this point of view, error handling is essentially part of the interface to the host program and should be modified to suit your requirements.

Adding New Data Types

The first step in almost any extension is to add new data types to represent the particular kinds of data used by the problem domain. This is a completely straightforward process:
  1. Add new type specifications to the Value class.
  2. Add constructors as necessary.
  3. Add new methods as necessary to wrap the methods for the new type.
  4. Add cases to the relevant switch statements in the Value class methods. If a particular Value class method is not relevant to a new data type, nothing special need be done, since all such methods throw an exception by default for data types that are not explicitly handled.
Since all of the example extensions described below require new data types, any of them can be used as an example of how to proceed.

In some cases, particularly where the target problem domain is mathematically oriented, user level operations can be handled by overloading conventional arithmetic operators. In these cases, adding new data types is all that is necessary to make a useful extension of one of the baseline interpreters. This special case is illustrated by the complex arithmetic example.


Adding New Operations

In the more general case, the target problem domain requires new operations as well as new data types. The new operations reflect the methods defined on the new data types.

Depending on the nature of an operation on a new data type, there are several ways it might be implemented: as a built-in function, as a separate statement, or as an arithmetic operator. The choice of implementation depends on various circumstances, consistency with the rest of the language, and the requirements of the users of the language, among others.

Built-in Functions

Built-in functions are the simplest means of adding new functionality, since they require only an addition to the function tables. They don't require any new syntax. Examples are the real(), imag() and conj() functions defined in the complex arithmetic extension. Another example is the size() function in the arrays and subscripts example.

Built-in functions are perhaps the implementation of choice when there is a single argument and the functional notation is not misleading.

Note that in the arrays and subscripts example, for consistency with the other examples, a new statement was defined to declare an array. It would have been easier, in fact, to simply create an array() function and add it to the function table. Then, instead of writing

  array x(size);
to create an array and assign it to the variable x, the user would simply write
  x = array(size);

Adding New Statement Syntax

Adding new statement syntax to implement new functionality is very nearly as easy as adding new built-in functions to the function tables as long as the new statement begins with a unique keyword. Because of the use of the keyword, the new statement syntax usually will not conflict with any existing statement syntax.

The baseline interpreters contain two instances of functionality provided by special statement syntax: the dump statement and the print statement. Note that these statements actually provide for comma delimited lists of arguments, a capability that is not available when using built-in functions.

The arrays and subscripts example uses a simple declaration statement to declare arrays. It should be noted that with very little effort, this statement syntax could be extended to allow any number of arrays to be defined in a single statement. Consider the following syntax for example:

  simple statement
   -> array statement, ';'

  array statement
   -> "array", name, '[', expression, ']'
   -> array statement, ',', name, '[', expression, ']', ';'
This would allow for a comma delimited list of declarations:
  array x[17], y[17], z[17];
Note that the use of this more complex syntax would require no additional support code. Thus the decision to use the simpler or more complex syntax depends solely on the convenience to the end user, since the implementation and maintenance costs are essentially the same for both.

The structures and objects example uses a noticeably more complicated declaration statement, based on the C struct declaration. Note that the statement has three key fields, the name of the struct, the body of the struct, and a list of instance names. Any one of these three fields may be omitted. The syntax is written by enumerating the allowable combinations.

The local function example uses two new statements: the local statement and the function definition statement. The local statement is not absolutely necessary, but it provides a means for assuring that global variables are not inadvertently modified by a function call.

The function definition statement is the only syntax modification in the set of examples where the changes to the PLL version is not identical to the CLL version. The difference is the treatment of delimiters for the function body: begin/end in the one case, braces in the other.

When operations on problem statement objects involve several parameters, it is often desirable to use statements that contain "unnecessary" keywords in order to improve readability. For example:

statement
 -> "set", name, "to", expression, "volts;"
would allow statements of the form:
  set controlVoltage to baselineLevel + 3 volts;
Although this syntax is wordy, and most software developers would deem it tedious, it has the distinct advantage of being readable by end-users, leaving no doubt as to the units involved.


Adding New Expression Syntax

A third technique for adding functionality to an interpreter is to extend the expression syntax. This, however, requires more experience with parsing techniques than adding statement syntax, since the opportunities for introducing ambiguity are substantially greater. Because of the potential difficulties, extending the expression syntax should only be done in those cases where an improvement in clarity and convenience is clear.

There are two primary alternatives to extending expression syntax: using built-in functions and overloading existing operators. The choice should generally be determined by conventional usage in the problem domain. For instance, in complex arithmetic, the extraction of the real and imaginary part is conventionally written using functional notation. The complex conjugate, however, is commonly denoted by a bar over a variable name or expression. Because of the limitations of editing programs, this is not possible to represent conveniently in a script. The choice then is one between a built-in function, conj(), and an overloaded operator. An argument could be made that in the complex arithmetic example one should overload the '~' operator, which would actually be easier to implement than the conj() function. On the other hand, the meaning of conj() will be readily apparent to anybody familiar with complex variables who reads a script. The same cannot be said for the use of '~'.

The baseline interpreters contain one example of extending expression syntax: the Fortran style exponentiation operator. This provides the use of familiar notation for those who would otherwise find it lacking.

Both the arrays and subscripts example and the structures and objects example use simple expression syntax extensions, implementing subscript notation in the one case, member field notation in the other. In both cases, the notation is familiar from C and Pascal.

Keywords can be used in extending expression syntax as well as in statement syntax. Instead of using built-in functions to implement methods with multiple parameters, it can sometimes be advantageous to use "syntactic sugar", as it is sometimes called, to identify particular arguments. For instance, to specify a circle, one might use the following syntax:

primary
 -> "circle", "of"?, "radius:", expression, "at", '(',
     assignment expression, ',', assignment expression, ')'
Note the use of assignment expression tokens for the coordinates. In the CLL grammar the expression token may contain a comma which would conflict with the comma separating the x and y coordinates.

This syntax would allow a script to contain statements of the form:

  c = circle of radius z at (x+5, y-7);
Compare the readability of this statement with the more conventional function call notation:
  c = circle(z, x+5, y-7);
Although the function call notation is easy enough to use, a third party who reads the script has no way of telling which function argument corresponds to which parameter of the circle.

When making extensions of this sort, make sure that the rule is not right recursive. There must be some token, usually nonterminal, that delimits the expression on the right. Otherwise, the rule will generate conflicts in the grammar.

A more complex example of what can be done by extending expression syntax is illustrated in the file matrix.syn. This file adds rules to the expression syntax of the CLL language to allow more congenial representation of matrix and vector arithmetic. These extensions include:

  • The use of a postfixed '~' character to represent the inverse of a matrix. x~ is far less cumbersome than x**-1.

  • The use of a postfixed single quote character to represent the transpose of a matrix. Thus x' represents the transpose of x, in accordance with ordinary mathematical notation. Contrast the clarity of x'*y and x*y' with transpose(x)*y and x*transpose(y).

  • The use of vertical bars to represent the norm of a vector or an absolute value. Thus |x| is more in line with traditional mathematical usage than abs(x).

  • The use of square brackets to denote n-tuples. This allows specifying the components of a vector: V = [x, y, z]; The notation has been set up so that n-tuples can nest. Thus the elements of a matrix could be specified by writing:
    M = [[x, y, z],
         [u, v, w],
         [a, b, c]];
    
    Also, if X, Y, and Z are vectors, one could write:
    M = [X, Y, Z];
    
Not that the absolute value notation and the tuple notation could be useful in many more contexts than simply matrix manipulation.

The changes to the syntax required to implement these extensions are as follows:

  • Adding the postfix operators requires that the definition of factor be changed from:
    factor
     -> primary             // next higher precedence level
     -> primary, "**", unary expression
    
    to:
    factor
     -> postfix expression  // next higher precedence level
     -> postfix expression, "**", unary expression
    
    postfix expression
     -> primary
     -> postfix expression, '~'                // inverse
     -> postfix expression, '\''               // transpose
    
    Note that it is necessary to interpolate a new level in the operator precedence hierarchy. If one attempts to simply add the following rules:
    primary
     -> primary, '~'                // inverse
     -> primary, '\''               // transpose
    
    the grammar will be ambiguous because primary already has right recursion due to the cast operators. Thus an expression like (double) x' could be parsed either as ((double) x)' or as (double)(x'). Of course, this expression is meaningless, but that is, of course, irrelevant to the syntactic analysis.

  • To implement the absolute value and tuple features, two new rules for primary are required:
    primary
     -> '[', arg list, ']'              // tuple
     -> '|', additive expression, '|'   // absolute value
    
    Since these rules are neither left nor right recursive, there is no problem with ambiguity and no new level in the precedence hierarchy needs to be added.

    Note the use of additive expression in the rule for absolute value. This allows the user to write |x+y| or |x*y|, but requires parentheses if a more loosely binding operator is to be used within the vertical lines. If the rule had been written:

    primary
     -> '|', expression, '|'   // absolute value
    
    the grammar would yield numerous conflicts, since given the complete expression syntax, it is possible to require the absolute value of an expression such as x|y. In this case, the absolute value would be written |x|y|. Although such a construct is not, properly speaking, ambiguous, it can't be parsed with an LALR-1 grammar. At the same time, no user is going to find it any easier to understand than the parser does.
The remainder of the work for implementing these features amounts to adding reduction procedures to the above rules. Except for the tuple definition, the functionality represented by these rules would already be a part of a vector/matrix package. Implementing the tuple definition is a little more interesting, but the necessary techniques should be clear from studying the treatment of arg list in the function call logic and the treatment of arrays in the arrays and subscripts example.


Other Ways to Customize an Interpreter

The purpose of the language recognized by the interpreter is to enable end users to write scripts confident that the interpreter will not misconstrue their intentions. Ideally, it also ought to be possible for someone other than the writer of the script to read and understand the script without undue effort.

To an extent, software specialists have become so accustomed to the peculiarities of conventional programming languages that they often do not recognize how inscrutable their programs can be.

There are two basic classes of end users: those who are comfortable with mathematical notation and those who find it confusing and hard to understand. The latter group, of course, are a much greater challenge to the language designer than the former. On the other hand, the mathematically sophisticated user may well appreciate the implementation of notation commonly used in his field, as in the matrix arithmetic syntax described above.

If the end users of an interpreter are not mathematically inclined, they should be spared any notation that cannot be effectively explained within the domain of their expertise.

Much of the syntax of C/C++ certainly falls within this category for people who are not professional programmers. As can be seen from the description of the Pascal-like language used in this kit, PLL, significant simplifications can be made without sacrificing functionality. Some operations simply become somewhat more longwinded and tedious, but certainly less obscure. More importantly, it will be noticed that, with the exception of the for statement, the support code in PLL is no different from that in CLL.

As an example of the visual difference between the two syntaxes, a script to make a table of sines and cosines would look like this in CLL:

  for (i = 0; i <= 90; i++) {
    angle = pi*i/360;
    print angle, "deg: ", sin(angle), "  ", cos(angle), "\n";
  }
but would look like this using the modified syntax in PLL:
  for i := 0 to 90 do begin
    angle := pi*i/360;
    print angle, "deg: ", sin(angle), "  ", cos(angle), "\n";
end
An important issue is to avoid forcing the user to make distinctions where he perceives no difference. Following the example of Pascal, for instance, PLL uses ":=" to distinguish the assignment statement from the relational equality operator. This distinction however is not at all important, as can be readily seen by replacing ":=" with '=' in pll.syn and using AnaGram to analyze the grammar. The choice between ":=" and '=', then, is a matter of determining which notation is more congenial to the end users of the interpreter.

Other syntactic changes can be made with essentially no effort. In PLL, the rule for the assignment statement could be changed from:

  assignment statement
  -> lvalue, ":=", expression
to:
  assignment statement
  -> "set", lvalue, "to", expression
without requiring any changes to reduction procedures or other supporting code.

Similarly, the C assignment operators, such as "+=", "-=", "*=", or "\=", can be represented by productions of the form:

  simple statement
  -> "add", expression, "to", lvalue
  -> "subtract", expression, "from", lvalue
  -> "multiply", lvalue, "by", expression
  -> "divide", lvalue, "by", expression

Another easy modification has to do with function calls. Consider the following function definition:

  function presentValue(interestRate, amount, elapsedTime) {
    ...
  }
There are three arguments. A call to this function is conventionally written thus:
  x = presentValue(7.85, 2734.23, 5);
Anyone reading a script who finds this call might rely on an accompanying comment, if it exists, to determine the meaning of the statement, but without looking up the definition of the function he cannot assure himself that the arguments to the function are in the correct order.

It is quite easy to add syntax to a language to provide self-documenting function calls. Here is how this function call would look with a very simple syntax change:

  given
    amount = 2734.23;
    elapsedTime = 5;
    interestRate = 7.85;
  calculate x = presentValue;
Here the arguments are given by name and the order of the arguments is not significant. Such a function call notation, though wordy and tedious from the point of view of a professional programmer has the virtue of great clarity and can be easily read and checked for validity by the non-professional.

Of course, it goes without saying that when implementing such a notation, the developer would include code to acheck the argument names and assure a one-to-one matchup between the arguments provided and the definition of the function.

Here is the syntax for the above statement:

  given statement
   -> "given", given parameters,
      "calculate", lvalue, '=', name, ';'

  given parameters
   ->
   -> given parameters, name, '=', expression, ';'
The keywords "given" and "calculate" are offered only as a suggestion. Any pair of keywords that do not conflict with other statements in the grammar could be used. The choice should be governed by acceptability to end users. Implementation of the necessary support code for this syntax should be moderately straightforward.

Although with respect to readability the above syntax is a clear improvement over the conventional function call, it is possible to improve it further. The problem, from the point of view of the problem domain specialist, is that the arguments are still simply numbers. In the real world, numbers are usually meaningless unless the corresponding units of measurement are also specified. The above example would be a good deal more reliable if the function call were written thus:

  given
    amount = 2734.23 (USD);
    elapsedTime = 5 (yrs);
    interestRate = 7.85 (percent/yr);
  calculate x = presentValue;
Very few, if any, programming languages provide for units. Units are, however, moderately easy to implement and pay a rich reward in reliability of scripts. It should not be overlooked that a Mars probe was lost a few years ago because of confusion over units. Though the loss had nothing to do with scripts, it is easy to see that the traditional software usage of leaving units unstated might well have been contributory.

Adding support for units, then, is another modification that can provide significant utility and reliability at a rather low cost.

The file units.syn shows how to add support for units to the syntax. It is identical to cll.syn except for the addition of syntax sufficient to implement units. The changes are as follows:

  1. Add a token to distinguish numeric constants from constants in general:
      numeric constant
       -> integer
       -> real
    
  2. Define unit expression:
      unit expression
       -> name
       -> unit expression, '*', name
       -> unit expression, '/', name
    
  3. Add a new production for primary:
      primary
       -> numeric constant, '(', unit expression, ')'
    
  4. Modify print statement to allow an optional unit specification:
      print statement
       -> "print", print expression
       -> print statement, ',', print expression
    
      print expression
       -> assignment expression
       -> assignment expression,
          "units:", '(', unit expression, ')'
    
Several observations should be made:
  • Units may only be used with constants. The units associated with the values of variables must be computed as the value of the variable is computed.

  • The parentheses around unit expression protect against ambiguity of the star and slash operators. A star or slash within the parentheses is part of the unit expression. A star or slash outside the parentheses is part of the overall expression of which the unit expression is only part.

  • The "units" keyword in print expression serves to specify that these are the units in which the specified expression is to be printed. Note that the specified expression might itself have units:
      print 5280 (ft) units: (m);
    
    This statement might be used to print the length of a mile in meters.

Implementation of the necessary support code for a units implementation is beyond the scope of this kit. Nevertheless, it is a moderately straightforward undertaking for any reasonably skilled programmer. The first step is to define a Unit class. The data fields of a Unit instance would describe its relationship to the base units of the system and would contain a conversion factor to convert to the base units.

The second step is to define a QuantityWithUnits class that would contain a real value and either a unit object or a reference to a unit object. The Value class would then be extended to wrap an instance of QuantityWithUnits, and the appropriate arithmetic operators would be implemented. Note that addition and subtraction can only occur if their operands have equivalent dimensions. The units of products and quotients can be generated from the units of the operands.

With this foundation, the reduction procedures neede to handle the units can readily be written for any of the interpreter architectures.


Table of Contents | Parsifal Software Home Page


XIDEK
Extensible Interpreter Development Kit
Copyright © 1997-2001, Parsifal Software.
All Rights Reserved.