Introduction to Syntax Directed Parsing

Home

Trial Copy

Intro. to Parsing

Users Say...

Special Features

Notation Summary

New 2.01 Features

File Trace

Grammar Trace

Glossary

Examples

Expression evaluator (freeware)

XIDEK interpreter kit (freeware)

Lex/Yacc Comparison

If-else ambiguity

Contact Parsifal

Introduction to Syntax Directed Parsing

(Adapted from Chapter 4, AnaGram User's Guide)

What is Syntax Directed Parsing?
Describing an Input Sequence
How a Parser Works
A Note on Notation
Reduction Procedures
Building a Parser

What is Syntax Directed Parsing?

Every programmer has to deal with input data. Usually processing input data depends on what has preceded, and often even on what follows, the input under consideration. Keeping track of these dependencies in order to know how to process the data is called parsing. It is often relatively easy to keep track of simple dependencies when first writing a program. As the program develops, as new features are added, as bugs are fixed, the dependencies often stop being simple. Input processing becomes a headache, since it is hard to keep track of or even identify all the particular cases. Changes to the program cause unexpected problems and program maintenance threatens to get out of control.

Syntax directed parsing is a technique for addressing these difficulties. In syntax directed parsing, the input part of a program is constructed automatically, by means of a standard algorithm, from a high level description of the input data structure. Code to perform any required processing of the data is attached to the description in a convenient way.

The description, being non-procedural, is generally easier to write and to modify than the equivalent program code, and much less likely to harbor bugs. It is easier to read, and easier to maintain. It is easy to re-use in other programs requiring the same input, thus encouraging uniform interfaces. The technique also simplifies the overall program by decoupling the input and processing components and providing a natural, modular structure.

To use syntax directed parsing you first write the input data description, called a grammar. A file which contains a grammar is called a syntax file. A parser generator, such as AnaGram, then can create, from the syntax file, a function (or program) called a parser, written in C or C++. The parser keeps track of all the dependencies in your input, and calls certain functions, reduction procedures, to deal with specific units or sequences of data as they are encountered.

Reduction procedures are the code you write to process your data. In your grammar they are linked to the structures in your input, so that your parser will call them at precisely the right times with precisely the right data. Note that with this technique you need only provide a non-procedural description of the input. The details of flow of control are handled entirely by the parser. When you write reduction procedures, you can concentrate entirely on what you have to do with the data. You don't have to encumber your code with switches and tests to determine the structure of your input.

The parsers you build using a parser generator such as AnaGram may be complete stand-alone programs, or they may serve as input routines for a more extensive program. Some programs may even use more than one parser.

Describing an Input Sequence

Writing a grammar consists of describing the acceptable input sequences for your program. The vehicle for describing an input sequence is called a production. Productions show how a logical component of the input can be made up of a sequence of more elementary components. A production that describes a date might be written:

  date
    -> name of month, day, comma, year

The components of the input are called tokens. The sequence of tokens on the side of the production is called a grammar rule, or rule for short. The individual tokens on the right side of the rule are also called rule elements.

The token on the left side of the production is called the reduction token for the rule. Tokens may have semantic values, as distinguished from syntactic values, which you can use in your reduction procedures. For instance, the value of name of month could be an integer in the range zero to eleven, or it could be a pointer to an ascii string. The value of day could be an integer in the range one to thirty-one.

A grammar consists of a number of such productions, each of which describes some component of the input in terms of other components. It does not take very many productions to describe quite complex input streams. A grammar for the C language, for instance, requires about two hundred productions.

Many people find the term "production" quite confusing. It comes from theoretical linguistics where it is used to describe how one may produce sequences which correspond to a set of grammatical rules. Ironically, the major usage of the idea has been in parsing where the interest is not so much in creating sequences which satisfy the grammatical rules as in decoding and analyzing such sequences. Nonetheless, it is convenient, in the above example, to say that the token date produces a sequence of tokens consisting of name of month, day, comma, and year. We also say that the sequence reduces to date.

There may be more than one production to describe a given component, if there is more than one way it may be represented. For instance,

  date
    -> day, name of month, year

describes another common way of writing a date. In other words, a reduction token may produce a number of different grammar rules.

Tokens which appear on the left side of one or more productions are called nonterminal tokens. Those which appear only on the right sides of productions are called terminal tokens. Terminal tokens are the units which actually appear physically in the input. Nonterminal tokens are identified when a sequence of tokens that matches the right side of a production is seen in the input. When AnaGram analyzes a grammar, it assigns a unique token number to each token it finds in the grammar.

Nonterminal tokens, such as date in the example above, may appear in any grammar rule just as though they were input tokens. The token on the left side of a production can even appear on the right side as well. Such a production is called a recursive production. When a nonterminal token appears on the right side of a production, it may be represented in this context by any of the grammar rules it produces. Grammars described in this manner are called context free grammars since there is no contextual constraint on which of the rules that a token produces can appear in any given context.

Recursive productions may be either left recursive or right recursive. Left recursive productions are those where the recursively defined nonterminal appears as the first element in the recursive rule. Right recursive productions are those where it is the last element. If it appears anywhere between, the production is said to be center recursive.

Any nonterminal token which has a recursive production must also have at least one simple, non-recursive production. Otherwise, it is not possible to create a finite sequence of terminal tokens from the nonterminal token.

Recursion may also occur implicitly in a grammar when one of the tokens on the right side of a production itself has a production involving the token on the left. Such implicit recursion sometimes may involve numerous levels of productions. Implicit recursion occurs most commonly in describing constructs such as arithmetic expressions or the block structure of programming languages.

Clearly, grammars can accommodate multiple levels of structure in the input sequences they describe. There must, at the top, be a single token which encompasses the entire input. This special token is variously called the grammar token, the goal token or the start token.

AnaGram allows you to specify terminal tokens explicitly as ascii characters, or even as sets of ascii characters, right in the grammar. Thus, you may write '0-9' to represent the set of ascii digits, or 'A-Z' to represent the set of upper case letters. The semantic value of such a token is the ascii character code that actually appears in the input stream. If the various sets you use in your grammar overlap, they may not properly represent terminal tokens. In this case, AnaGram automatically extends your grammar appropriately.

How a Parser Works

The aim of a parser is to match its input with the full syntactic structure specified by the productions which make up the grammar. The primary component of a parser is an input buffer, sometimes thought of as a stack, into which tokens are shifted, or stored sequentially, as they are encountered in the input. At the same time that a token is shifted into the input buffer, its semantic value is pushed onto the value stack. A token is not shifted into the buffer unless it "makes sense", that is, unless it is consistent with the rules of the grammar and with the input that has preceded it. If a token does not make sense, the parser signals a syntax error. In order to determine whether a token makes sense, the parser has a sort of decision table which provides a list of acceptable tokens for each of a number of states. The table also specifies what the parser is to do with each acceptable token. When the table indicates that a token is to be shifted into the buffer, it also specifies a new state. The parser stacks the current state on a state stack and jumps to the new state. Thus every time a token is shifted into the input buffer, a state number is pushed onto the state stack. For each state of the parser, excepting only the initial state, there is a unique token which will cause a jump to that state. This token is called the characteristic token for the state.

When the rightmost, or most recent, tokens in the input buffer match the right side of a production precisely, the parser may (see below) replace the tokens that match the rule with a single token, the token on the left side of the production. This process of replacing a sequence of tokens with a single token is called reduction. The token that replaces the sequence of tokens is called the reduction token.

The actual mechanism of the reduction is quite important. At the same time that input tokens are removed from the input buffer, state numbers are popped from the state stack, so that when all input tokens matching the rule have been removed, the parser state has been restored to the value it had at the time the first token in the rule was seen. As state numbers are popped from the state stack, token values are popped from the value stack. If the rule has a reduction procedure, temporary variables are loaded with the values popped from the stack and the reduction procedure is called. The reduction token is now shifted into the input buffer just as though it were an input token. If the reduction procedure returned a result it is shifted into the value stack as the value of the reduction token. The parser stacks the current state again and jumps to a new state as determined by the parser tables.

If there are no errors in your input, when the last token has been read from the input, shifted into the input buffer, and reductions performed, there will be precisely one token in the input buffer: the grammar, or goal, token which describes your entire input. At this point your parser declares itself finished.

Reductions do not necessarily occur every time a rule matches the tokens in the input buffer. If the reduction token does not make sense, that is, if it is not consistent with the rules of the grammar and with the input that has preceded it, the parser will not perform the reduction. Suppose there are tokens in the input which match one of the rules given above for date. A reduction will not occur unless date is one of the tokens the parser is actually looking for at that stage in the input.

Even when the reduction token would make sense, there are still situations where the reduction would not take place. Suppose a grammar includes the two productions given above for date as well as the following:


  date
    -> name of month, day

This production is the same as the first, but with no year specification. When a parser directed by this grammar has encountered name of month and day, it can't tell without looking further whether it has a short form or a long form date. In such a circumstance, the parser looks at the next following token, which is called a lookahead token. If the lookahead token is a comma, then, in the absence of other productions, the input is a long form date. If the lookahead token is not a comma, then the input is certainly not a long form date, and it is proper to reduce the short form production.

Suppose the lookahead token were a comma and the grammar were also to contain the following production:


  appointment
    -> date, comma, time

Since a comma can follow date, according to this rule, and can also follow day according to the first production, it is impossible to determine, simply by looking at the lookahead token, whether the date was being given in short form or long form. One would have to look beyond the comma to see if what follows the comma matches the rules for time or for year. Although it is possible to build parsers which can do this, it is not generally feasible. Situations of this sort are called conflicts. AnaGram warns you about the conflicts in your grammar when it analyzes your grammar, and provides numerous facilities to help you correct them.

A Note on Notation

Context free grammars have been traditionally represented in the literature using Backus-Naur Form, or BNF. In Backus-Naur Form, certain characters, called metacharacters, are used to punctuate productions and all other printable characters are taken to represent themselves literally. Named tokens are denoted by enclosing the name within angle brackets < > . The left side of a production is distinguished from the right side by the characters ::= . If several productions have the same left side, the vertical bar | is used to separate them. The elements of a grammar rule are simply juxtaposed to indicate that one follows another. Blanks are ignored. Thus, in BNF, the first production given for date, above, would be:

  <date> ::= <name of month> <day>, <year>

AnaGram uses a notation more consonant with ordinary programming usage. Thus token names need not be bracketed and literal characters must be appropriately quoted. The elements of rules are joined by commas. Using this approach, there is no need for metacharacters and it becomes possible to make a number of useful extensions to the notation.

Reduction Procedures

Of course, the reason for parsing an input stream is to interpret the data in the stream and to process it in some useful manner. The primary tool for doing this is the reduction procedure, often referred to as a "semantic action". A reduction procedure is a piece of C code that is executed when a particular grammar rule is reduced. Often, a reduction procedure calculates a value which becomes the semantic value of the reduction token. The input to the reduction procedure consists of the values of the tokens that make up the grammar rule.

AnaGram allows you to assign C variable names to the tokens in a grammar rule, so that you can refer to them in the reduction procedure. To assign a C variable name, simply follow the token in the rule with a colon and the C variable name. Simple reduction procedures can be written as C or C++ expressions. At the end of the rule, write an equal sign and an expression, terminated by a semicolon. For example:

  (int) hex digit
     -> '0-9':d         =d-'0';
     -> 'a-f':d         =d-'a'+10;
     -> 'A-F':d         =d-'A'+10;

When any one of these rules is matched, the value of the token in the rule is assigned to the temporary variable. The expression to the right of the equal sign is evaluated and the value of the expression is stored as the value of hex digit, which has been declared to be an int. These productions define hexadecimal digits as ascii characters, and calculate the binary value of the digit in terms of the ascii character code, d. The binary value becomes the value of hex digit. Hexadecimal digits can be combined to make hexadecimal numbers by writing the following productions:

  (int) hex number
    -> hex digit
    -> hex number:n, hex digit:d     =16*n+d;

There are several important points to notice in this example. First, reduction procedures are executed "from the bottom up". That is, the reduction procedure for hex digit is executed before any reduction procedure for hex number. Second, if there is no reduction procedure for a production, the value of the first token in the rule is assigned to the reduction token. Thus, it is not necessary to provide a reduction procedure in the first production for hex number. Third, the reduction procedures for recursive productions are always executed after the reduction procedure, if any, for the non-recursive production which begins the recursion. Finally, when an input sequence is described using left recursion, as in this example, the elements of the sequence are processed left to right.

If you wish to process the elements of a sequence right to left, you may use right recursion. For example, it is sometimes convenient to define the fraction part of a decimal number thus:

  (double) fraction part
    -> '0-9':d                   =(d-'0')/10.;
    -> '0-9':d,fraction part:f   =(d-'0'+f)/10.;

In this case the leading digits are stored temporarily in the parse stack, and then the fraction part is evaluated right to left only when the last digit has been found.

Reduction procedures can be more complex than simple expressions. After the equal sign you may include an arbitrary block of C or C++ code, enclosed in braces, { }. To return a semantic value for the reduction token simply use a return statement. Of course, reduction procedures have the full resources of C or C++ at their disposal. They may set and interrogate global variables and may call functions freely.

Since the reduction procedures you write will probably need some support code, such as #include statements and declarations, you may incorporate C or C++ code into your syntax file at any point. You need only enclose it in braces ({}). Such code is called embedded C. All embedded C code is also copied to the parser file, and precedes all of your reduction procedures.

Building a Parser

In order to round out a parser into a functioning program it needs input procedures, as well as error diagnosis and recovery capabilities. AnaGram has a number of options available which give you a high degree of flexibility in configuring a parser to suit your particular needs. All of the options are provided with reasonable defaults, so that you can safely disregard any option until you need the features it provides.

Invoking a Parser

Normally, AnaGram configures parsers as functions which you can call from elsewhere in your program. In this situation, you call the parser, it processes its input, and returns either when it has finished or cannot proceed because of errors. Alternatively, if you set the event driven configuration switch, your parser will be configured so that you have two procedures to call: an initializer and a parser. In the event driven configuration you start the parse by calling the initializer and then you call the parser once for each unit of input. Using the event driven mode makes it quite easy to configure a parser as a filter and to chain several parsers together so that the output from one parser is the input to the next. Such multi-stage parsing is a convenient way to deal with complex input that is not context free.

Communicating with a Parser

The complete status of your parser is contained in a single data structure called a parser control block. All communications with a parser take place via the parser control block. Input procedures must place input data in the appropriate field in the parser control block. When the parse is complete or encounters an error, the results of the parse may be found in the parser control block. When AnaGram builds a parser it includes, in the header file it generates, a typedef statement which defines the structure of the parser control block.

Parser Input

The input to your parser may be either characters read directly from an input stream, or tokens accumulated by a pre-processor or lexical scanner. The way you provide input to your parser depends on how your grammar defines input tokens and also on whether or not you have requested an event driven parser.

If your parser is event driven, you provide its input by storing the input code and the input value, if any, into the parser control block and calling the parser.

If you have set the pointer input configuration switch in your syntax file, you simply initialize the pointer field in your parser control block before you call your parser. Your parser will then read its input directly from memory by simply incrementing the pointer as necessary.

Otherwise, your parser will invoke a macro called GET_INPUT every time it needs more input. You may define GET_INPUT according to your needs. You can define it so that it calls an input function, or you can define it so that it executes in-line code each time it is invoked. Your GET_INPUT macro should store its input code in the input_code field of the parser control block. If you do not write a GET_ INPUT macro, AnaGram will provide one which will read characters from stdin.

If your grammar does not define terminal tokens in terms of ascii characters or external token numbers, your GET_INPUT will have to determine the appropriate internal token number for each input token. To assist you in determining these token numbers AnaGram provides a typedef enum statement in the header file. You can then use named constants to specify the internal token numbers for the input tokens.

Error Handling

Your parser must be prepared to deal with erroneous input. There are two aspects to error handling: diagnosing the error, and recovering from the error. On encountering an error, your parser will invoke a macro called SYNTAX_ ERROR. If you do not provide a definition for SYNTAX_ERROR, AnaGram will provide a simple error diagnostic. AnaGram can also provide automatic error diagnoses which pinpoint the location of the error.

AnaGram provides two options for error recovery: error token resynchronization and automatic resynchronization. These are techniques for getting your parser back in synch with its input so it can proceed after encountering an error. Normally, if you do not select one of these recovery techniques, your parser will terminate when it encounters an error; however, you may override this default if you wish and provide your own recovery technique.

Links to: Home page | Trial Copy | Syntax Directed Parsing | Glossary