Parsifal Software


XIDEK
Extensible Interpreter Development Kit
Reference documentation



Virtual Machines

A virtual machine, or bytecode interpreter, is a sort of simulation of a computer, usually used to implement a script language. You define an instruction set for this machine according to the needs of your script language.

The instructions for the virtual computer are called "bytecodes". Each bytecode uniquely identifies a function to be executed by a "bytecode interpreter", which simply picks up the bytecodes in order and invokes the appropriate functions. Java source code, for example, is compiled into bytecode and a variety of bytecode interpreters are available to execute this bytecode on different platforms.

Where does parsing come in? You need to compile statements in the script language into bytecode instructions for your virtual machine. To do this, the compiler needs to parse the statements, identify their component parts, and transform them into bytecode.

The best way to create the compiler, from the point of view of freedom from bugs, ease in making changes, and maintainability, is to generate it from a formal grammar using a parser generator such as AnaGram. A grammar also has the advantage of being self-documenting.

In the Extensible Interpreter Development Kit, the ScriptMethod class defines the imaginary computer. It has functions for listing and executing compiled bytecode.

Bytecode Interpreter Definitions

Introduction

The bcidefs.h and bcidefs.cpp files contain the classes and functions necessary to implement a simple bytecode interpreter. The interpreter uses the opcodes defined in opcodes.h. Text representations of the opcodes are found in the constant external array opcodeText[], defined in bcidefs.cpp. Note that for listings of generated code to be valid, any changes in the opcode enumeration must be reflected in the opcodeText[] array.

Two classes are defined: ScriptMethod, which implements a bytecode interpreter, and CodeFragment, which provides functions useful in compiling code for the interpreter. Both the astci and dci interpreters use these classes to compile and execute bytecode.


The Bytecode Architecture

The bytecode interpreter implements a very simple virtual machine architecture. There are no registers. Instead, a pushdown accumulator stack is used. Unary operations operate on the top stack element. Binary operations operate on the top two stack elements replacing them with the result of the operation. Apart from the DUMP instruction there is only one instruction that references memory: the LOCATE instruction which pushes a pointer to a specified value onto the stack. FETCH and STORE instructions then use this pointer to reference memory.

The values of variables are stored in an array of Value objects contained in a Dataset object. Variable names are identified in a dictionary, an instance of the AgDictionary< AgString> class. The dictionary assigns a unique integer to each variable name included in the dictionary. The integer is used to index the array of Value objects in the Dataset object to locate the value of the variable.

Constant values are stored in a constant array, accessed by immediate operators that index the array of constants.

Instructions may consist of one or two bytecodes. The first bytecode is the opcode. The second bytecode, if it exists, is the index of a variable value in the data array, the index of a constant in the constant array, or the offset of a branch instruction.

For convenience, bcidefs.h contains a typedef statement to define a Bytecode type. When defined as a signed char, the interpreter is limited to 128 variable names and 128 immediate constants. A more serious limitation is that branch instructions are limited to 127 Bytecodes in the forward direction and 128 Bytecodes in the reverse direction. To replace these limitations with less restrictive ones, simply change the typedef from signed char to signed short.

The implementation of the bytecode interpreter is found in the apply() function of the ScriptMethod class.


Opcodes

The Opcode enumeration is defined in opcodes.h.

The instructions that are implemented are as follows:

  • Instructions with no operand. Instructions consist of a single bytecode.
    HLT
    Stop execution of the interpreter. Cause an uninitialized Value object to be returned as the value of the apply() function.
    FETCH
    Dereference the pointer value on the top of the stack and replace it with the value it references.
    I_FETCH
    Dereference the pointer value on the top of the stack, increment the value it references and then replace the pointer with the incremented value.
    D_FETCH
    Dereference the pointer value on the top of the stack, decrement the value it references and then replace the pointer with the decremented value.
    FETCH_I
    Dereference the pointer value on the top of the stack, replace the pointer with the referenced value, and then increment the referenced value.
    FETCH_D
    Dereference the pointer value on the top of the stack, replace the pointer with the referenced value, and then decrement the referenced value.
    STORE
    Pop the value from the top of the stack, store it through the pointer that is now on the top of the stack. Now dereference the pointer on the top of the stack so that the value that was stored is now the top element of the stack.
    POP
    Remove top element from stack and discard it.
    SAP
    Pop the value from the top of the stack, and store it store it in the location specified by the next value on the stack. The location value is also popped from the stack and discarded.
    PRINT
    Pop the top element from the stack and print its value.
    RETURN
    Pop the top element from the stack and save it as the return value from the apply() function. Then stop execution of the interpreter.

  • Binary operators pop two values from the stack, compute and push the result on the stack.
    ADD
    Add
    SUB
    Subtract
    MUL
    Multiply
    DIV
    Divide
    IDIV
    Integer divide
    RDIV
    Real divide
    MOD
    Remainder
    POW
    Raise to power
    AND
    And
    IOR
    Inclusive or
    XOR
    Exclusive or
    LS
    Left shift
    RS
    Right shift

  • Binary operators on memory. These operators pop the top element from the stack, dereference the next stack element, fetch its value, compute the result, store the result in the referenced location, and replace the reference on the stack with the result of the computation.
    ADDM
    Add the top element to the referenced variable
    SUBM
    Subtract the top element from the referenced variable.
    MULM
    Multiply the referenced variable by the top element.
    DIVM
    Divide the referenced variable by the top element. >MODM
    Set the value of the referenced variable to the remainder after dividing it by the top element.
    ANDM
    And the referenced variable with the top element of the stack.
    IORM
    Set the referenced variable to the inclusive of it with the top element of the stack.
    XORM
    Set the referenced variable to the exclusive of it with the top element of the stack.
    LSM
    Left shift the value of the referenced variable the number of bits specified in the top element of the stack.
    RSM
    Right shift the value of the referenced variable the number of bits specified in the top element of the stack.
    POWM
    Raise the value of the referenced variable to the power specified by the top element of the stack.

  • Comparison operators pop two values from the stack and push a result of one or zero back on to the stack depending on whether the condition is true or false.
    LT
    less than
    LE
    less than or equal
    GT
    greater than
    GE
    greater than or equal
    EQ
    equal to
    NE
    not equal to

  • Unary operators pop a value from the stack, compute and push the result back on the stack.
    NEG
    Negate top element on stack
    NOT
    Logical NOT of top element on stack
    COM
    bitwise complement

  • Cast operators set the type of the top element on the stack.
    CAST_LONG
    Make value long. If the value is not integral it is truncated. If the type is not integer or real, an InterpreterError exception is thrown.
    CAST_DOUBLE
    Make value double. If the type is not integer or real, an InterpreterError exception is thrown.

  • The following instructions have a second parameter, K, which is either the index of a variable or the index of a constant.
    LOCATE K
    Push a pointer to the K'th variable onto the stack.
    PUSHC K
    Push the K'th constant onto the stack.
    DUMP K
    Print the name and value of the K'th variable.

  • Branch instructions. K is the offset in number of bytecodes relative to the address of the instruction after the branch instruction. That is, BR -2 is an infinite loop.
    LAND K
    Branch on false. Remove the top element of the stack if the branch is not taken.
    LOR K
    Branch on true. Remove the top element of the stack if the branch is not taken.
    BR K
    Branch unconditionally.
    BRF K
    Branch on false. Always remove the top element of the stack.
    BRT K
    Branch on true. Always remove the top element of the stack.
    CALL K
    Call the K'th function in the functionTable.
    RPT K
    If the top element of the stack is less than or equal to zero, pop the top of the stack and branch. This can be used to implement counted loops.


Implementation Classes

Constant Class

The Constant class is derived from the Value class. Its purpose is to handle a rather peculiar situation with constants such as 0 and 0.0, integer and floating point representations of zero, respectively. The Value class rightly does not distinguish these two values. Nevertheless, in order to handle some arithmetic operators, in particular the logical operators, the actual type of the constant must be known. If the constantList field of ScriptMethod were implemented using the Value class, constants with the same value but different type would be recorded as integer or real depending on whether the first instance of the constant in the script were integer or real. This would make it impossible to select the correct arithmetic operation, integer or double, when one of the operands was a constant.

The Constant class, therefore, overloads the equality operator to distinguish between integer and floating point zero, so that correct arithmetic operations can be selected. It also overloads the hash function in order to take the type field into account, so that integer and real constants with the same value can be distinguished.

Because constructors cannot be inherited, all the constructors of the Value class must be replicated.


ScriptMethod

Introduction

The ScriptMethod class implements the bytecode interpreter. A ScriptMethod object is constructed from an AgDictionary<AgString> object, an array of bytecodes and an array of constants. The dictionary identifies the variables used in the bytecode. The class has two methods: a list method which generates a listing of the bytecode and an apply method which executes the bytecode using a specific Dataset object.

Member Fields

AgDictionary<AgString> &dictionary;
Identifies variable names. Note that the dictionary must be defined externally. This allows several interpreters to share the same dictionary.

AgArray<Bytecode> bytecode;
Contains the bytecode to be executed.

AgArray<Constant> constantList;
Contains the constant values used by the script.

Constructors/Destructor

ScriptMethod(const char *text, AgDictionary<AgString> &d);
This ScriptMethod constructor is not implemented as part of bcidefs.cpp. The implementation is found in the actual bytecode compiler module.

ScriptMethod(const ScriptMethod &m);
Copy constructor.

~ScriptMethod();
The destructor has nothing to do.

Member Functions

Value apply(DataSet &d) const;
Executes the stored bytecode using the specified DataSet object. The DataSet object must have been constructed using the same AgDictionary<AgString> as was used to construct the ScriptMethod.

If the bytecode exits because of a return statement, the value of the expression, if any, specified in the return statement, will be returned as the value of the apply() function. If the script exited without executing a return statement, or no expression was specified with the return statement, the Value object returned has its type set to uninitType.

Values in the DataSet may be initialized before calling the apply() function and may be queried afterwards. A given Dataset object can be used with any number of ScriptMethod objects, so long as they are constructed from the same dictionary. Thus one ScriptMethod object can be used to initialize values for another, for instance.

const ScriptMethod &list(ostream &f) const;
This function may be used to write the generated bytecode to a file. The list function converts the offsets in branch instructions to "absolute" addresses, in order to make it easier to follow the code.


CodeFragment

Introduction

A CodeFragment object contains a sequence of bytecodes. CodeFragment objects can be concatenated to produce longer CodeFragment objects. This makes it possible to build up complex constructs from simpler sequences. Class methods are defined for appending instructions to a CodeFragment object, for concatenating CodeFragment objects and for implementing loop and conditional constructs. Using CodeFragments, it is possible to compile bytecode in a very intuitive manner.

In addition to the code itself, a CodeFragment object contains a list of branch locations in the code that represent loop exit or loop continue statements. The patchBranches() method can be used to set these branches to appropriate values once the actual continue and exit points of a loop are known.

The CodeFragment class contains a number of static member functions which create code for conditional and loop constructs by appropriately combining previously generated code.


Constructors/Destructor

CodeFragment()
Constructs an empty CodeFragment.

CodeFragment(const CodeFragment &)
Copy constructor.

~CodeFragment()
There is nothing for the destructor to do.

Member Fields

AgStack<Bytecode> bytecode;
The AgStack template class provides a convenient mechanism for accumulating bytecode.

AgStack<int> branchList;
The branchList stack contains the offset locations within the bytecode of the operands of loop continue and loop break branches in the CodeFragment object. The operand is set to 0 for continue, 1 for break. When two CodeFragment objects are concatenated, the concat() function updates the offsets appropriately.

Member Functions

int size();
Returns the length of the code in the object in Bytecodes.

AgArray<Bytecode> getBytecode();
Creates an AgArray<Bytecode> object containing the bytecode in the CodeFragment object.

CodeFragment &append(Opcode op);
Appends the specified opcode to the CodeFragment and returns a reference to the CodeFragment itself.

CodeFragment &append(Opcode op, int offset);
Appends the specified opcode and the integer parameter to the CodeFragment. The parameter is truncated to the size specified by the Bytecode typedef. For fetch and store instructions, the parameter is the index of the variable name in the AgDictionary<AgString> object. For branch instructions, the integer is the signed offset for the branch. The function returns a reference to the CodeFragment object.

CodeFragment &appendContinue(Opcode op);
Appends the specified opcode to the CodeFragment object and also an operand with the value equal to zero. The offset of the operand within the bytecode is pushed onto the branchList stack for later completion by the patchBranches() function.

CodeFragment &appendBreak(Opcode op);
Appends the specified opcode to the CodeFragment object and also an operand with the value equal to one. The offset of the operand within the bytecode is pushed onto the branchList stack for later completion by the patchBranches() function.

CodeFragment &concat(const CodeFragment &code);
The specified fragment is appended to the CodeFragment object. The function returns a reference to the compound object. The offsets in the branchList belonging to the argument object are incremented by the size of the CodeFragment object and added to the branchList of the object.

CodeFragment &patchBranches();
The patchBranches function scans the branchList and sets the target location of each loop continue branch to be the first instruction of the fragment. The target location of each break branch is set to be the first location following the fragment. The branchList is then reset.

int containsBreak();
Returns non-zero if the CodeFragment contains a break instruction, zero otherwise.

Static Member Functions

static CodeFragment ifStatement(CodeFragment &condition, CodeFragment &statement);
This function combines the code that evaluates the condition with the code that executes the statement to create the code for the complete if statement.

static CodeFragment ifElse(CodeFragment &condition, CodeFragment &onTrue, CodeFragment &onFalse);
Given the code to evaluate the condition, the code to be executed if the condition is true and the code to be executed if the condition is false, this function combines these pieces of code with the necessary branches to create the code for the entire if-else statement.

static CodeFragment whileLoop(CodeFragment &condition, CodeFragment &statement);
This function combines the code for the loop condition and the body of the loop with the appropriate branch instructions. To do so, it calls appendBreak(BRF)to append the an exit branch to the condition code and appendContinue(BR) to append a continue branch to the statement code. It then concatenates the two pieces of code and calls patchBranches() to calculate and insert the necessary offsets for these branches as well as for any break or continue statements that might be found inside the body of the loop.

static CodeFragment doLoop(CodeFragment &statement, CodeFragment &condition);
The do-while loop differs from a while loop only in that the loop condition is not executed the first time through the loop. Therefore, the code can be generated by prefacing the loop with a jump over the condition code and its conditional branch. Note that this is done by placing the code for the condition physically before the loop body instead of following the loop body as it actually occurs in the script.

static CodeFragment repeatLoop(CodeFragment &statementList, CodeFragment &condition);
The repeat until loop differs from a do-while loop in two respects. First, the first argument is effectively a list of statements rather than a single statement, although this makes no difference whatsoever to this function. Second, the treatment of the condition is the opposite of the treatment for the do-while loop. In the repeat-until loop the loop continues if the statement is false, and exits if the statement is true.

static CodeFragment forLoop(CodeFragment &initializer, CodeFragment &condition, CodeFragment &increment, CodeFragment &statement);
The code for a for-loop is laid out in the following order. First the initializing code, then the increment code, followed by the condition code and finally the body of the loop. Between the intializing code and the increment code there is an unconditional branch over the increment code, since the increment code must not be evaluated the first time through the loop. The beginning of the increment code is the continue location for the loop. With this layout, coding the loop is the same as for the while loop.

static CodeFragment simpleForLoop (const CodeFragment &lValue, const CodeFragment &begin, const CodeFragment &increment, const CodeFragment &end, const CodeFragment &statement);
This function lays out the code for a simple Pascal-like for loop. Because the loop parameters cannot be changed in the body of the loop, it is possible to calculate, before entering the loop, the number of times the statement is to be executed. This simplifies considerably the problem of dealing with negative increments.



Table of Contents | Parsifal Software Home Page


XIDEK
Extensible Interpreter Development Kit
Copyright © 1997-2002, Parsifal Software.
All Rights Reserved.